Australian copyright law is not fit for training AI

Australian copyright law is not fit for training AI

With the onslaught of copyright infringement lawsuits being filed against OpenAI, Anthropic and other major generative AI providers for training models on proprietary data that is often being regurgitated verbatim back to users, there has been a corresponding surge in commentary on the treatment of AI by, in particular, US and EU copyright law. In this article, I hope to shed some much-needed light on the status of AI training under the copyright law of an often-forgotten part of the ‘rule-of-law world’, Australia. I write from the perspective of a data scientist who also holds an Australian law degree and who works in the legal tech industry.

There are two key issues that Australian copyright law raises for training AI. The first pertains to the procurement of copyrighted training data. The second concerns whether the very act of training AI on copyrighted data (as distinct from procuring data for that purpose) constitutes copyright infringement.

In brief, I find that Australian copyright law is not fit for purpose when it comes to both the procurement of copyrighted training data and the training of AI on copyrighted data. At least not if we want to see Australians training their own foundational models and Australian businesses disrupting Silicon Valley’s stronghold on AI innovation. While finding a solution to the problem is beyond the scope of this article, I urge policymakers to quickly adopt a new model for ensuring creators are adequately compensated while also enabling rather than stifling innovation, lest our burgeoning AI industry falls irreparably behind our international counterparts.

Prima facie, it is an infringement to reproduce a copyrighted work without a licence to do so or an exemption under the Copyright Act 1968 (Cth). Unlike in the US, it is not a defence to have radically ‘transformed’ an existing copyrighted work into a new work. Australian fair dealing exceptions are considerably narrower than the US’ fair use defence and are limited to research and study, criticism and review, parody and satire, and news reporting. There are also exceptions for where a copyrighted work is used for the services of the Crown or a temporary reproduction is ‘incidentally made as a necessary part of a technical process of using’ a work, provided that such use does not itself constitute an infringement.

In the case of procuring copyrighted training data, it is obvious that the most popular method for quickly collecting large training datasets, web scraping, will necessarily involve some degree of copyright infringement. That is, unless licences can be sought from millions of websites and content creators to collate such datasets. This also extends to the use of existing open-source datasets such as The Pile and Common Crawl that have been produced by scraping websites without authorisation.

By creating persistent copies of websites, typically including their underlying HTML, JavaScript and CSS, which are themselves copyrightable, even if a website’s content is not, one is in effect digitally reproducing copyrighted works. If those reproductions are unlicensed, they will constitute copyright infringement the moment they are created, regardless of how they have been transformed and regardless of whether they never leave one’s computer.

Most exceptions to the Copyright Act are unlikely to apply to the procurement of copyrighted training data unless of course such data is being procured by a government for the services of the Crown.

The exemption for dealing in a copyrighted work for the purpose of research or study is not guaranteed and requires courts to balance several factors, including the effect the dealing would have on the potential market value of the work and the proportion of the work that has been reproduced. Where web scraping is concerned, more likely than not, it will be whole websites and not portions thereof that are collected. The risk that a copyrighted work is eventually encoded into a model’s weights and can therefore be reproduced at any time by users of that model could have a substantially detrimental effect on its market value.

The most compelling potential defence I have seen floated for web scraping copyrighted training data arises under Section 43B of the Copyright Act, which protects the incidental temporary reproduction of a copyrighted work made as a necessary part of a technical process of using the work. I could envision this section being used to stream temporary copies of copyrighted data stored in RAM to an AI in real-time as the data is being scraped. One might also argue that saving web scraped data to a disk and later deleting it after it has been used to train a model qualifies as temporarily reproducing such data, although that would likely be a much harder sell as it involves saving data to a disk, effectively creating persistent copies of the data, even if they are never intended as such.

The primary obstacle to using Section 43B to lawfully stream copyrighted data as it is being scraped to a model is that the Section does not apply where the end use would itself constitute an infringement. Thus, for Section 43B to be a viable defence for web scraping copyrighted data, it must also be the case that it is not an infringement to train AI on such data, an issue covered in the next section of this article.

Suppose you’ve managed to obtain a licence to collect an enormous amount of copyrighted training data. Or perhaps you have, as mentioned above, built a system to automatically stream copyrighted training data to a model as it is being scraped. Or maybe you think you can make the argument that saving and later deleting copyrighted training data counts as temporarily reproducing that data. The question still remains: is it legal to train AI on such data if you do not have a licence to use it for training or a defence for doing so?

The answer is, yes and no. Yes, if none of the resulting model’s weights encode reproductions of unlicensed copyrighted works, however lossy those reproductions may be. No, in all other cases.

The risk of copyrighted data being encoded in a model’s weights is far from hypothetical. Many of the earlier mentioned copyright lawsuits now being defended by major generative AI providers centre on the claim that copyrighted data has been unlawfully reproduced in models’ weights. Given the sheer enormity of GPT-4 and the ubiquity of Shakespeare, it is no surprise, for instance, that GPT-4 can reproduce much of Romeo and Juliet verbatim. This also applies to plenty of other works that are still under copyright but are just as ubiquitous in popular culture as Shakespeare is (eg, Lord of the Rings, Game of Thrones, Star Wars).

Although there is real merit in analogising the way LLMs lossily encode information in weights with how human brains also encode information in neurons, so long as neurobiological storage is excluded, Australian copyrighted law is agnostic as to how copyrighted data may be reproduced. An encoding of a Sylvia Plath poem in a model’s weights that can only activated by a particular input is just as much a reproduction of the poem as it is to republish that same poem on your blog, which is just as much a reproduction of the poem as it is to export it to a PDF, zip that PDF, attach that ZIP to an email, encrypt that email such that depending on the decryption code used it can decode to either Plath’s poem or a poem by Ted Hughes, and finally write the hex code of the resulting franken-file to paper.

Giving more thought to the risk of copyrighted data being reproduced in model weights, it seems to apply even to those who have obtained licences to train on copyrighted datasets. Say, for example, that one was able to procure every work owned by Elsevier, Springer and all other major academic publishers. While it would be perfectly lawful to reproduce those collections as a whole, there may still be content contained within them that, if either isolated or stitched together, would constitute copyright infringement.

In the course of researching, reviewing, criticising or reporting on copyrighted works, academics and journalists will often quote select portions of those works to support their assertions. As insubstantial excerpts of works, they may not constitute infringement on their own, but when taken together they may be used to reconstruct the entirety of source works or substantial portions thereof, which would most certainly violate copyright.

The risk is that, in the process of constructing an internal model of knowledge contained in corpora, AI may have substantial portions of copyrighted works inadvertently encoded into their neurons, even if none of those portions appear in complete form in their training data. Again, this risk is not merely hypothetical. A Google Scholar search for “Harry Potter” shows that there are roughly 131,000 indexed works containing that exact phrase, no doubt many of them quoting at least one passage from the book series. It is not particularly difficult to imagine that, after reading all those works, one would be able to reconstruct a substantial portion of at least one Harry Potter book. Nor is it difficult to believe that a model with as many trillions of parameters as GPT-4 has could learn substantial portions of Harry Potter if fed those same academic works. Thus, it is apparent that even if you make best efforts to licence training data, you may still have no choice in whether your model learns to reproduce unlicensed copyrighted works quoted in such data.

So, where does this leave us? Well, while Australian copyright law may not (directly) harm content creators, it certainly does hurt local AI practitioners, who are unable to freely compete with their international counterparts without fear of legal retribution. The only way to guarantee not infringing on copyright when training a foundational model is to obtain a licence for all training data and a licence for any content contained within such data that, even if lawfully reproduced within a dataset, could be inadvertently unlawfully reproduced in model weights.

Clearly, that is not practical when foundational models require ridiculous amounts of training data that is also diverse enough to be able to approximate all of human knowledge, however roughly.

In its current state, Australian copyright law is not fit for purpose. AI practitioners must be able to operate in an environment of legal certainty to feel secure in training their own foundational models. At the same time, it is absolutely essential that creators are adequately compensated for their work, without which none of the advancements made in AI in recent years would be possible.

The need for reform is urgent given just how rapidly advancements in technology are being made and just how desperately the largest generative AI players are to build their own ‘AI moats’.

Australian data scientists no doubt have the skills to compete with their international counterparts, so a lack of education is not the problem. The problem is a copyright regime lagging behind the rest of the rule-of-law world.

While finding a solution is beyond the scope of this article, I suggest policymakers act quickly to adopt a new model that will ensure Australian creators are properly compensated while also accelerating rather than stifling innovation, lest our burgeoning AI industry falls irreparably behind our international counterparts.

Thankfully, policymakers are already aware of this and are currently in the process of investigating how best to reform Australian copyright law. Last month, the Attorney-General announced the establishment of a new copyright and AI reference group, which will hopefully be able to break down the many obstacles to AI innovation posed by our copyright regime.

Disclaimer: The views expressed in this article are the author’s and do not necessarily reflect those of their employer. This article does not, and is not intended to, constitute legal advice.

Source link