Generative AI and Copyright: Busting Prevalent Myths, Revealing Truths
Main takeaways
- Europe can lead on AI innovation with all its talent and research, but not without a more nuanced copyright debate, which continues to be dominated by misconceptions.
- Copyright protects original expressions fixed in works, but not the underlying ideas.
- There is no need for extra rules. The EU already has all the necessary regulatory tools to protect copyright in the age of artificial intelligence.
The emergence of generative artificial intelligence (AI) presents a major opportunity for the European Union to regain its competitive edge and shape the future of technology – building on Europe’s wealth of talent, top-tier educational and research institutions, and access to computing power.
However, successfully navigating this landscape requires us to understand the ongoing debate over the use of copyright-protected content in the training of AI models, going beyond the simplifications that tend to dominate these discussions in the EU. Let’s take a look at several common misconceptions, as well as more accurate ways of looking at this debate and its far-reaching societal consequences.
1. “Generative AI models contain copies of their training data” – Myth or truth?
Myth. Generative AI systems do not store compressed or bitwise copies of the data they have been trained on within the actual models. Instead, they utilise mathematical techniques to learn patterns and concepts as numerical parameters or weights. For example, when trained on text data, these models adjust parameters to reflect probabilities for certain word combinations, enabling them to generate coherent responses.
Like someone who has read many books on a particular subject and subsequently writes a book with their own take on that topic, generative AI systems do not copy but understand patterns and are therefore able to generate original content.
Exposure to certain content during the training phase can influence the output generated at a later stage, because the result produced is a statistical probability. For example, if a model has been exposed to tens of thousands of images of cats during training, it can learn what the characteristics of a ‘cat’ are and is therefore more likely to be able to generate a picture of a cat accurately when asked in the output stage.
2. “Knowledge, facts, ideas, and information are free-flowing and cannot be copyrighted” – Myth or truth?
Truth. Fundamental rights and legal frameworks – including the Universal Declaration on Human Rights, the Convention for the Protection of Human Rights and Fundamental Freedoms, and the European Charter of Fundamental Rights – uphold the right for anyone to access and disseminate information.
Facts, ideas, knowledge, and information simply cannot be copyrighted. It’s essential to keep safeguarding this key principle, which has also been reflected in copyright laws, amid the ongoing debates about AI development here in Europe.
3. “Copyright law protects data – end of story” – Myth or truth?
Myth. Copyright law safeguards original expressions fixed in a tangible medium but not the underlying ideas, facts, or information. In other words, you are not allowed to exploit someone else’s copyright-protected work without their permission, but you can learn as much from it as possible. Making this distinction is crucial to prevent overreach of copyright protection, as well as to upholding freedom of expression and information.
The EU’s recent Copyright Directive and AI Act also recognise this. The delicate balance struck by the text-and-data-mining exceptions of the former should not be weakened to avoid interpretations contrary to the spirit of the law and unintended consequences for fundamental rights.
4. “Every government agrees that copyright holders should be able to say no to AI model training” – Myth or truth?
Myth. Among the major jurisdictions in the AI race, only the European Union has so far granted rightsholders the legal entitlement to opt out of text and data mining (TDM) for training purposes. Countries like the United States and Japan for example (but also Singapore, South Korea, Malaysia, Israel, and Taiwan) have exceptions in place that help to promote innovation and data accessibility without this opt-out entitlement.
This divergence between the EU and the rest of the world creates legal uncertainty and impacts the competitiveness of Europe’s AI industry, as well as reducing the availability of the latest innovations for European business and users. Balancing rightsholders’ interests with the latest technological advancement has always been complex, requiring nuanced approaches. International collaboration and alignment will therefore be key to reducing the uncertainty currently surrounding the use of data, including copyright-protected content.
5. “Rightsholders have no way to prevent their data from being included in training sets” – Myth or truth?
Myth. Rightsholders can rely on the universally accessible and robust robots.txt protocol to prevent web crawlers from ingesting their content. While some rightsholders may encounter technical difficulties regarding the level of granularity of the protocol, the technology and creative industries can work together to develop more targeted solutions and standards.
Major technology companies already provide more sophisticated tools to rightsholders that want to have their data excluded from training sets. It goes without saying that finding suitable technical solutions is in the interest of both the technology and creative sectors.
Additionally, it’s important to clarify that rightsholders’ use of opt-out should not impede TDM activities otherwise allowed by law. In fact – debunking yet another prevalent myth – rightsholders are not allowed to oppose TDM in all cases, for example if it’s done for the purpose of research or accessibility. Addressing these challenges in a cross-sectoral way is vital to ensuring legal compliance and fostering innovation.
6. “Rightsholders only want to licence their content to AI companies” – Myth or truth?
Myth. While some rightsholders offer licences to use their works for AI development, many others are unprepared for AI licensing, both practically and conceptually. Recent figures show that the majority of websites and rightsholders do not block access to their data for AI training, which indicates that they do not see any need to opt-out. Surveys of content creators also show that their views on the use of works for AI training are not as black and white as some try to make you think – creators’ preferences are much more nuanced.
In fact, slowing down generative AI innovation through cumbersome licensing of AI training data is likely to negatively impact the media and creative sectors, which are among the first to really benefit from this type of innovation. Many media companies and professionals are already exploring how to use AI for their own content creation for instance. Balancing the interests of the creative sector with technological advancements and fundamental rights will remain essential to foster this kind of innovation.
Conclusion
Navigating the intersection of generative AI and copyright requires everyone involved to reconcile competing interests while upholding fundamental freedoms and key principles. It’s crucial to address misconceptions, clarify legal frameworks, and foster collaboration between sectors to ensure equitable and sustainable AI development in the EU.
Europe already has all the necessary regulatory tools to protect copyright in the age of AI. Stakeholders just need to work together to make sure everyone can harness the full potential of generative AI.
An abridged version of this article first appeared in German in Tagesspiegel Background Digitalisierung & KI – Generative KI und das Urheberrecht: Mythen und Fakten.