Training AI on books is legal under certain conditions

A US federal court has made a statement regarding the training of artificial intelligence on copyrighted works. The case is still ongoing, but a partial ruling sheds new light on the issue of copyright laws.

Kamil Świdziński

Three non-fiction authors — Andrea Bartz, Charles Graeber, and Kirk Wallace Johnson — sued Anthropic for illegally using their books to train the Claude model. While the case is still ongoing, Judge William Alsup issued a partial ruling addressing key aspects of the dispute.

Reproduction in the training process is fair use

This judicial opinion doesn’t end the proceedings, but it’s worth mentioning because it is partially favorable for AI companies, though with some caveats. The court recognized that machine learning from books obtained legally might meet the criteria for fair use. It also clearly emphasized that copying content from pirated sources is a violation of copyright law. While the court hasn’t yet decided the whether responses generated by the model could constitute a secondary infringement, it suggested that using appropriate filters might be enough to prevent this.

In his opinion, Judge Alsup noted that training an LLM model on books isn’t aimed at reproducing them but at creating new value, which is the ability to generate diverse responses.

“The model trained upon works, not to race ahead and replicate or supplant them, but to create something different” – Judge William Alsup

The fact that some parts might be “remembered” by the model doesn’t automatically mean there’s an infringement, as long as they aren’t mechanically reproduced in the responses.

This reasoning is similar to the argument used by the US Supreme Court in the Google Books case, where it was recognized that processing books for searching and indexing constitutes transformative use.

Scanning paper copies and illegal sources

The next issue was about the digitization of books. Anthropic defended themselves by saying they only scanned legally acquired paper copies to facilitate processing. The court found that there was no illegal redistribution, just a conversion of format.

On the other hand, the court had no doubts that sourcing content from pirate repositories like Books3 or Library Genesis is a law infringement. The explanation that it was a “research library” simply wasn’t enough. This is a relevant message not just for Anthropic but also for other companies in the industry like Meta, which also trained models on Books3.

It’s not a photocopy, it’s a creative tool

Can the court’s position be surprising? Not really. Models aren’t designed to copy and store works, but to learn structures and correlations. However, generating a response style could be seen as violating someone’s rights, although this issue isn’t fully settled yet.

For the AI industry, this court opinion sends a clear message: use legal sources and apply filters. That’s all it takes to be on the right side of the law.

A judge who understands technology

It’s worth mentioning WHO issued this partial ruling. It was William Alsup, known from the Oracle vs Google case, where he ruled that Google’s use of Java APIs fell within the bounds of fair use. This is a judge who has repeatedly shown that he understands the complexity of computer technology.