Worrying times for AI ahead? Major tech companies are running out of data to train LLMs


In the rapidly evolving landscape of the AI economy, data emerges as the linchpin that propels advancements. It is not merely a component; instead, it stands as the lifeblood of AI models, influencing their fundamental functionality and overall quality.


The correlation is clear: the more abundant and diverse the human-generated data an AI system is exposed to, the more adept it becomes.

However, a disturbing revelation casts a shadow over AI companies—the finite nature of natural data. In a warning that has been reverberating among AI researchers for nearly a year, experts caution that the well of raw data, essential for training AI systems, is running dry.

Rita Matulionyte, a professor of information technology law at Macquarie University in Australia, emphasizes this concern in an essay for The Conversation.

A study by the AI forecasting organization Epoch AI adds a tangible timeline to the foreboding scenario. The study estimates that AI companies could confront a shortage of high-quality textual training data as early as 2026, with low-quality text and image data potentially depleting between 2030 and 2060.

This data scarcity substantially threatens AI firms relying on continuous data influx to enhance their models. The trajectory of AI development has mirrored the infusion of increasing volumes of data. If this supply chain stagnates, the consequences could reverberate throughout the industry.

Matulionyte suggests a potential remedy in the form of synthetic data generated by AI models. However, the viability of this solution is contested, with research indicating a risk of an “inbreeding effect” that distorts the model when trained on AI-generated content. Despite these challenges, some companies are already exploring synthetic training sets.

A pragmatic alternative emerges in the concept of data partnerships. Companies or institutions possessing vast repositories of high-quality data could enter into agreements with AI companies to share this data, often in exchange for financial compensation.

OpenAI, a prominent Silicon Valley AI firm, recently launched a Data Partnership initiative. In a blog post, the company underscores the significance of such collaborations in steering the future of AI and creating more relevant models for diverse organizations.

As the race for data intensifies, the practicality of data partnerships becomes a focal point. Many AI datasets currently derive from internet-scraped data created by online users, making data partnerships a plausible solution. Yet, with the escalating value of data, the competition for datasets is poised to intensify, raising questions about the willingness of institutions and individuals to share their data with AI entities.

Even with data partnerships, there remains to be more certainty about the sustainability of the data supply. Despite the seemingly boundless expanse of the internet, the impending challenge of dwindling data reserves forces a reassessment of assumptions about the endless nature of this critical resource.

Share your love
Facebook
Twitter
LinkedIn
WhatsApp

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

error: Unauthorized Content Copy Is Not Allowed