There are two things to know about artificial intelligence.
The first is that it can “learn” from structured datasets, and the second is it needs massive resources, including human labor around tasks like data tagging and labeling, to run the models so that they can do the learning necessary to be trained and perform effectively.
These two elements came to a head before the start of the New Year as AI pioneer OpenAI and its Big Tech partner Microsoft were sued Dec. 27 by The New York Times (NYT) for training their models on copyright-protected and paywalled NYT content without disclosing or compensating the NYT.
Buried in the complaint was the fact that the news publisher had been working to finalize a licensing deal with OpenAI, but neither side was able to reach an agreement.
Some publishers, including the Associated Press and Axel Springer, have already reached commercial agreements to license their content to OpenAI.
Now, a Thursday (Jan. 4) report by The Information reveals just how much they have gotten from those deals — somewhere between $1 million and $5 million. The NYT, observers must assume, was probably hoping for more.
That amount of money could be considered less than an overwhelming amount for the tech sector’s $90 billion darling to shell out, particularly when compared to the $50 million Apple is reported to be shopping around to other publishing companies like Condé Nast, the publisher of Vogue and The New Yorker, and IAC, which owns People, The Daily Beast and Better Homes and Gardens, to train its own AI systems.
But perhaps most importantly, in the run-up to the time before the NYT’s legal complaint went public, OpenAI had started fielding complaints that its flagship ChatGPT products were getting lazier.
we’ve heard all your feedback about GPT4 getting lazier! we haven’t updated the model since Nov 11th, and this certainly isn’t intentional. model behavior can be unpredictable, and we’re looking into fixing it ?
— ChatGPT (@ChatGPTapp) December 8, 2023
A potential culprit? The fact that OpenAI could no longer rely on NYT-based language datasets to train its models, resulted in a decline in capability.
OpenAI’s GPT crawler has been blocked from accessing data by Vox Media as well.
While OpenAI did not immediately reply to PYMNTS’ request for comment, the fact that today’s commercially available models can dip and rise in performance is an important data point for enterprise businesses to consider when looking to integrate the innovation into their workflows.
It represents a potentially crucial point of failure that could derail business-critical processes and have unforeseen consequences across tech stacks.
That isn’t to say that AI is too risky to onboard, but rather that firms need to be aware of the AI systems they are turning to, and most importantly the data sources behind them.
See also: Walled Garden LLMs Build Enterprise Trust in AI
The foundational large language models (LLMs) commercialized today are typically trained using deep learning algorithms which are built on a neural network trained using billions of words of ordinary language.
The specific sources of data used for training are often undisclosed by the companies behind these models. However, much of the data comes from publicly available information on the web that has been scraped and analyzed by LLMs, whose technical architecture was not designed to verify and attribute the inputs to its system outputs.
In light of the use of copyrighted materials for training data, several news media publishers have reportedly met with OpenAI to discuss licensing their content for use in training that firm’s AI models.
The publishers meeting with OpenAI included The Wall Street Journal owner News Corp., Dotdash Meredith owner IAC, USA Today owner Gannett and industry trade association News/Media Alliance.
“I think [the lawsuit is] going to put a shot across the bow of all platforms on how they’ve trained their data, but also on how they flag data that comes out and package data in such a way that that they can compensate the organizations behind the training data,” Shaunt Sarkissian, founder and CEO at AI-ID, an AI tracking, authentication, source validation and output data management/control platform, told PYMNTS in December. “The era of the free ride is over.”
The U.S. Copyright Office has launched an initiative to study the use of copyrighted materials in AI training, indicating that legislative or regulatory steps may be necessary in the near term to address the use of copyrighted materials within AI model training datasets.
The underlying issue — and the potential reason that OpenAI, a startup, is offering less than Apple, the world’s most valuable company, to publishers — is that the cost of developing generative AI software is already sky-high.
Forcing AI companies to pay a market price for all the data they scrape online would more likely than not push many of them into ruin.
For all PYMNTS AI coverage, subscribe to the daily AI Newsletter.