Wikipedia is attempting to dissuade artificial intelligence developers from scraping the platform by releasing a dataset that’s specifically optimized for training AI models. The Wikimedia Foundation announced on Wednesday that it had partnered with Kaggle — a Google-owned data science community platform that hosts machine learning data — to publish a beta dataset of “structured Wikipedia content in English and French.”
Wikipedia is giving AI developers its data to fend off bot scrapers
Data science platform Kaggle is hosting a Wikipedia dataset that’s specifically optimized for machine learning applications.

Wikimedia says the dataset hosted by Kaggle has been “designed with machine learning workflows in mind,” making it easier for AI developers to access machine-readable article data for modeling, fine-tuning, benchmarking, alignment, and analysis. The content within the dataset is openly licensed, and as of April 15th, includes research summaries, short descriptions, image links, infobox data, and article sections — minus references or non-written elements like audio files.
The “well-structured JSON representations of Wikipedia content” available to Kaggle users should be a more attractive alternative to “scraping or parsing raw article text” according to Wikimedia — an issue that’s currently putting strain on Wikipedia’s servers as automated AI bots relentlessly consume the platform’s bandwidth. Wikimedia already has content sharing agreements in place with Google and the Internet Archive, but the Kaggle partnership should make that data more accessible for smaller companies and independent data scientists.
“As the place the machine learning community comes for tools and tests, Kaggle is extremely excited to be the host for the Wikimedia Foundation’s data,” said Kaggle partnerships lead Brenda Flynn. “Kaggle is excited to play a role in keeping this data accessible, available, and useful.”
Most Popular
- The $20,000 American-made electric pickup with no paint, no stereo, and no touchscreen
- USAID decides not to collect former workers’ abandoned devices
- GPU prices are out of control again
- Kuxiu’s ‘world first’ solid-state power bank costs more but lasts much longer
- Google is killing software support for early Nest Thermostats