ByteDance, the parent company of TikTok, is stepping up its efforts to train generative AI models with the launch of a new web-scraping tool. Dubbed Bytespider, the bot was reportedly introduced in April and has already become one of the most aggressive web scrapers in operation.
Research from bot management company Kasada and bot monitoring firm Dark Visitors revealed that ByteDance’s Bytespider scrapes web data 25 times faster than GPTbot, OpenAI’s web scraper for its ChatGPT platform. It is also scraping 3,000 times faster than ClaudeBot, the scraper Anthropic uses for its Claude platform.
A scraping frenzy
Since its debut, Bytespider’s activity has only increased, with noticeable spikes in scraping over the past six weeks, according to a Fortune report.
It appears ByteDance is trying to quickly gather as much data as possible to catch up with other tech giants like Google, Meta, and OpenAI. All of these companies use web scrapers to collect vast amounts of online data to train their large language and multimodal models (LLMs or LMMs).
However, ByteDance’s scraper, like those used by other AI companies, does not adhere to the robots.txt file, which is meant to signal scrapers to avoid taking data from specific websites.
Though robots.txt isn’t legally enforceable, its disregard has stirred controversy. Web scraping is often seen as infringing on copyright, particularly when used to train AI models.
As generative AI tools rely heavily on web data to function, scraping has become a contentious issue, with many individuals and organizations arguing that their work is being copied without compensation. The practice has been around for decades, primarily for search engines, but the rise of AI has introduced new legal and ethical concerns.
ByteDance’s AI push
ByteDance’s aggressive scraping efforts come when the company is scrutinized, particularly in the US. President Joe Biden has signed legislation requiring ByteDance to either sell TikTok or shut it down, citing national security concerns.
Despite this, ByteDance seems determined to advance its AI capabilities.
ByteDance’s scraping frenzy suggests the company is working on a new large language model. Reports from earlier this year indicate that ByteDance was behind in the generative AI race and even relied on OpenAI to help build its model. This move violated OpenAI’s terms of service.
In early 2023, ByteDance launched Duabo, a chat-based LLM, but the model’s development was completed before the more recent data collection efforts.
One potential application for ByteDance’s new LLM is improving TikTok’s search functionality. TikTok recently updated its search feature to focus on keywords for ads, allowing advertisers to target trending words in real time. With a more robust AI model trained on up-to-date web data, TikTok could further enhance its search capabilities, creating a more competitive environment for advertisers relying on Google.
The rapid data collection and AI advancements suggest that ByteDance is eager to not only catch up but potentially reshape the landscape of search and AI, especially within the context of TikTok’s massive user base. These efforts could make TikTok’s search environment highly appealing to advertisers looking to reach larger audiences through precise, data-driven keywords and trends.