A data wall of their own making
Training a new generative AI model requires an enormous amount of data. But as The New York Times' Kevin Roose explains, "Gathering new data has gotten trickier" in recent months "as publishers and online platforms have taken steps to prevent their data from being harvested." Roose cites a new study from the Data Provenance Initiative that confirms an "emerging crisis in consent." Analyzing three popular data sets, C4, RefinedWeb, and Dolma, the research shows that in just the past year, "5 percent of all data, and 25 percent of data from the highest-quality sources, has been restricted" and "as much as 45 percent of the data in one set, C4, had been restricted by websites' terms of service" (per Roose). And AI companies are concerned, Roose reports:
Some A.I. executives I've spoken to worry about hitting the "data wall"—their term for the point at which all of the training data on the public internet has been exhausted, and the rest has been hidden behind paywalls, blocked by robots.txt or locked up in exclusive deals.
As I wrote in “How to Fix 'AI's Original Sin' " (which we shared last week), it doesn’t have to be this way, if AI companies would only put the lessons we learned from the web and YouTube into practice. Web-crawling search engines like Google made a bargain: we’ll read all your content and use it to build a search index, but that will be good for you because we’ll help people to find your content, and we'll help you monetize it. YouTube gave music companies asking for a takedown of any user video that contained copyrighted music a better alternative: let us monetize it for you and share the revenue. At O’Reilly, we ground all our AI derivatives in content from our authors, subject matter experts, and partner publishers, and tie it directly into our payment system, which allocates a share of our subscription revenue to our content providers based on usage.
I've had conversations with OpenAI and other AI companies since 2022 about the urgent need for an economic model by which they reward creators for participating in the AI ecosystem. But they've chosen instead to take without figuring out how to give back. The fact that more and more content is being closed off from use in AI training is a direct result of the content land grab. As the Chinese philosopher Lao Tzu once wrote, “Fail to honor people, they fail to honor you.”
It’s not too late to build a creator- and copyright-aware AI ecosystem that allows training on copyrighted material because it provides fair recompense for its use—not with one-time licensing fees (“selling your house for firewood”) but as part of a sustainable business partnership that allocates value to those who help create it. I made a few suggestions in my article referenced above, but a world of possibility awaits once entrepreneurs start seeing the possibilities in a copyright-aware AI ecosystem.
+ From AI Snake Oil: "AI Scaling Myths"
+ From The New York Times: "How Tech Giants Cut Corners to Harvest Data for A.I."
+ From Proof News and WIRED: "Apple, Nvidia, Anthropic Used Thousands of Swiped YouTube Videos to Train AI."
+ From The Verge: "Biden's Top Tech Adviser Says AI Is a 'Today Problem'."
+ ICYMI: Ilan Strauss and I are leading the SSRC's AI Disclosures Project. You can check it out here. And please follow our newsletter and social media accounts if you're interested.
Комментариев нет:
Отправить комментарий
Примечание. Отправлять комментарии могут только участники этого блога.