Generative AI Scraping and Data Theft

Data scraping, or the act of automatically extracting information from publicly available sources, is a widely used tool in the field of artificial intelligence. AI models, such as OpenAI’s ChatGPT or Meta’s LLaMA, use vast amounts of data to learn and enhance their capabilities. This process is critical for training software and producing consistent and understandable outputs.

However, when this data is copyrighted, the issue becomes much more complex.

The Charge: Copyright Infringement by OpenAI

Canadian novelist Mona Awad and U.S. writer Paul Tremblay have initiated legal action against OpenAI, claiming that their copyrighted works were used without permission to train the well-known and widely used artificial intelligence model, ChatGPT.

Awad and Tremblay’s claims are based on ChatGPT’s ability to generate “very accurate summaries” of their books. According to the authors, this accuracy suggests that their work was knowingly used in ChatGPT’s training, and the accuracy of the summaries constitutes crucial evidence in the legal process.

The lawsuit is based on alleged violations of the Computer Fraud and Abuse Act, a U.S. federal law, and the Electronic Communications Privacy Act.

The authors’ allegations, if legally founded, could have significant implications for OpenAI and the broader content creation community.

The Potential Repercussions for OpenAI and the Artificial Intelligence Sector

Attorney Matthew Butterick has launched a comprehensive campaign against generative AI models and created websites to disseminate information about legal actions initiated against AI models. The issue is not confined to OpenAI but extends to other companies using data-intensive machine learning systems.

In the case of Stable Diffusion software, the allegation is that five billion images were used without the consent of the original artists, potentially leading to significant financial losses. Stability AI’s CEO Emad Mostaque predicts that future AI models will need to be fully licensed to avoid legal challenges.

The Meta Platforms Case: LLaMa and Alleged Copyright Violations

Similar dynamics are occurring with Meta Platforms, Inc. (Meta). Attorney Joseph Saveri’s law firm has filed a class action in the Northern District Court of California, challenging Meta’s LLaMA model. Copyright infringement is alleged to have occurred through scraping of vast amounts of text without the consent of authors.

Saveri emphasizes the need to protect the rights of artists against illicit theft and fraud, stating that AI products like LLaMA may eliminate viable careers for authors.

Possible Outcomes Between Licensing Agreements and Fair Use Interpretations

Several variables could affect the outcomes of these lawsuits and have consequences for the entire artificial intelligence industry.

For now, entering into specific licensing agreements remains the most plausible solution to reconcile the interests of all parties involved. The balance appears to favor big tech, pending appropriate legislation and the realization that the world of artistic and creative works is undergoing a profound and irreversible transformation.

Avv. Alfredo Esposito
Featured on Agenda Digitale

The Charge: Copyright Infringement by OpenAI

The Potential Repercussions for OpenAI and the Artificial Intelligence Sector

The Meta Platforms Case: LLaMa and Alleged Copyright Violations

Possible Outcomes Between Licensing Agreements and Fair Use Interpretations

Read more insights

Fair Pay, Revenue Sharing, or Content Exploitation? The High-Stakes Legal Battle Over AI Revenue

Microsoft, Generative AI and Copyright: A “Social contract” or a “License to steal”?

From “Her” to “All of Us”: Scarlett Johansson’s Fight Against OpenAI Voice Generator