The rise of generative artificial intelligence has triggered a debate about the appropriate protections for copyrighted data. This column examines the economic incentives and social welfare implications of different copyright approaches. For ‘small’ AI models (trained using an identifiable corpus of content), it shows that giving content owners full copyright protection leads to higher investments in both content quality and AI model quality. For larger models, there is a trade-off between the benefits of training data access against the risk of harm to content owners. Policymakers should take these into account and craft copyright rules that promote both flourishing creative ecosystems and cutting-edge artificial intelligence.
In recent years, powerful generative artificial intelligence (AI) models have emerged, including large language models like ChatGPT, which can produce human-like text outputs from prompts, and image generation models like DALL-E, which creates images from text descriptions. While there is a broad debate regarding various economic issues associated with such models, from the environmental impact of power consumption (Abeliansky et al. 2023) to more standard liability issues (Kretschmer et al. 2023), a more recent flashpoint for discussion surrounds copyright protections. This is because training data used to build these AI models often include copyrighted content like books, articles, and online media. Should AI companies have to license and pay for the copyrighted data used to train their models? Or does such usage fall under fair use provisions? (See Samuelson 2023, for an excellent overview).
This issue received new prominence as a lawsuit was filed by a leading content provider against a leading AI provider. In 2023, the New York Times (NYT) filed a lawsuit alleging that OpenAI had used the newspaper’s copyrighted content to train its GPT large language models without permission. As evidence, the NYT demonstrated that both ChatGPT (created by OpenAI) and Bing Chat (which licenses GPT from OpenAI) were able to reproduce some NYT articles nearly verbatim when prompted in certain ways.
The NYT argued this showed the models were trained on its copyrighted material. It asked the court to prevent OpenAI from using models trained on NYT content and requested statutory damages for the alleged copyright infringement.
In response, OpenAI claimed it had not intentionally trained its models on NYT articles. Instead, it argued the verbatim reproducing of article snippets was likely due to ‘data contamination’, where some of the NYT content was unintentionally included in its training data after being published widely online. OpenAI said this verbatim regurgitation was an unintended glitch, not evidence of deliberate copying.
However, the NYT argued that even if the usage was unintentional, the ability to reproduce copyrighted NYT content meant the models were inappropriately deriving value from the newspaper’s work. The NYT said this ‘leakage’ of its content via AI chatbots could substitute for reading the original articles, harming its subscription business.
The case highlights thorny issues around AI and copyright. Even if companies aren’t deliberately copying protected content, training huge models on web-scraped data risks sweeping in copyrighted works. When an AI regurgitates snippets of those works, verbatim or in paraphrased form, has it infringed the copyright? How should the law handle unintentional but potentially harmful inclusions of copyrighted content? The OpenAI/NYT dispute raises these questions, but so far, there are no clear answers.
This is a classic conundrum. On one side, AI companies argue that individually licensing the huge volume of content used in training would be impractically costly. On the other, content owners and creators argue they should be compensated when their work is used and that uncompensated usage by AI could undermine their incentives and business models.
One of the arguments by those who argue that generative AI models do not infringe copyright protections is that in analogous situations, humans are not regarded as infringing. Consider the following three examples:
In all these cases, the human examples would generally not be considered copyright infringement under current law. If a person reads a book and writes a summary of it, or if a superfan publishes insightful commentary about a TV show, those activities are usually permissible.
However, the legal status is much murkier for the equivalent AI systems. If a company scrapes a large volume of copyrighted books to train an AI model or ingests full TV show scripts to build a chatbot, have they infringed on the rights of the copyright holders? There is no clear consensus currently.
These scenarios drive home the point that AI is raising challenging new questions around the boundaries of intellectual property law. Policymakers and legal scholars are grappling with how copyright frameworks built for a pre-AI world should evolve for a future in which machine learning models are trained on vast corpora of creative works. The human analogues provide a useful jumping-off point for working through the right balance between protecting the rights of copyright owners and enabling beneficial applications of artificial intelligence.
Of course, a deeper question that needs to be answered is why human-generated outcomes are not considered infringing in the first place. My research examines this question to understand the conditions under which an economist would want to protect original content creators whose content is used in training generative AI.
In Gans (2024), I develop an economic model of generative AI training and content creation under various copyright regimes to examine the economic incentives and social welfare implications of different approaches. A key factor is whether an AI model is ‘small’, using an identifiable corpus of content, or ‘large’, trained on such a huge dataset that the provenance of individual pieces of content can’t be determined. Examples of large models are the large language models of OpenAI, Google, Anthropic, Meta, or Mistral or the image generators such as Midjourney, Stable Diffusion, or DALL-E. Smaller models include most other AI models that are trained on particular datasets. Some of these are generative AI models that are trained to undertake specialised tasks or when otherwise large AI models have a specialised application (using Retrieval Augmented Generation or RAG) that relies on a specific corpus of data.
For small AI models, the model finds that giving content owners full copyright protection versus no protection leads to higher investments in both content quality and AI model quality and superior overall social welfare. Intuitively, copyright protection maintains strong incentives for creators to invest in high-quality original content. AI companies can then negotiate licenses to use that content, leading to better AI while still compensating creators.
The picture is more ambiguous for large AI models trained on huge datasets. Here, identifying and licensing individual content inputs simply isn’t feasible. Compared to full copyright, a ‘no copyright’ regime improves AI model quality by allowing unrestricted training. However, it risks undermining incentives for content creation if it leads to significant economic harm to creators and copyright owners.
Ultimately, for large AI models, whether a permissive fair use approach dominates full copyright protection depends on the relative magnitudes of 1) the value of training data in improving AI model quality and 2) the expected economic harm to content owners from uncompensated usage. If the AI benefits are large compared to the harm to content creators, a fair-use approach produces better social welfare outcomes.
An alternative ‘ex-post’ licensing mechanism is a potential solution to these stark trade-offs. In this ex-post fair use-like mechanism, AI companies could freely use copyrighted content for training, but content owners would have the option to sue for lost profits if they experience significant economic harm from an AI model utilising their content. Compared to a blanket fair-use approach, this maintains stronger incentives for content creation. However, it still enables free usage for training data, which benefits AI models without posing substantial risks to content owners.
In conclusion, this research highlights that the optimal copyright approach for generative AI depends on the nature of the AI models and training data involved. For smaller, targeted AI systems, robust copyright protection and licensing are likely to produce the best outcomes by ensuring both content owners and AI companies face strong incentives to invest in and utilise high-quality content.
For large-scale AI models trained on huge datasets, policymakers will need to weigh the benefits of training data access for AI progress against risks of harm to content owners. Fair use exemptions are likely to be optimal when the AI benefits substantially outweigh the harms to content owners. Alternatively, novel licensing mechanisms that allow free usage but preserve ‘backstop’ protections for content owners could provide a beneficial middle ground.
While much remains uncertain in this fast-moving domain, this economic framework provides a useful starting point for thinking through the incentives and trade-offs involved with different copyright approaches. Policymakers can craft copyright rules that promote both flourishing creative ecosystems and cutting-edge artificial intelligence by carefully considering the nature of the AI models and the relative benefits and risks for content creators and AI progress.