Research:Data

Other languages:

English ·português

There is a great deal of publicly-available, open-licensed data about Wikimedia projects. This page is intended to help community members, developers, and researchers who are interested in analyzing raw data learn what data and infrastructure is available.

If you have any questions, you might find the answer in the Frequently Asked Questions about Data. If you still have questions, you can email your question to the Analytics mailing list (more information).

If you wish to browse pre-computed metrics and dashboards, see statistics.

If this publicly available data isn't sufficient, you can look at the page on private data access to see what non-public data exists and how you can gain access.

Quick glance

By access method

Data Dumps(details)

Homepage – Download

Dumps of all WMF projects for backup, offline use, research, etc.

Wiki content, revisions, metadata, and page-to-page and outside links
XML and SQL format
once/twice a month
large file sizes
The dumps.wikimedia.org domain also hosts other data

APIs(details)

The MediaWiki API provides direct, high-level access to the data contained in MediaWiki databases over the web.
- Meta info about the wiki and logged-in user, properties of pages (revisions, content, etc.) and lists of pages based on criteria
- JSON, XML, and PHP's native serialization format

The Wikimedia REST API provides page content in various formats.

The Wikimedia Analytics API provides pageviews and aggregate edit stats.

Wiki Replicas(details)

Data Services allows Wikimedia Cloud Services users to query a sanitized copy of the Wikimedia MediaWiki databases.

Toolforge and Cloud VPS hosting environments include access to the Wiki Replicas.
PAWS is a Jupyter Notebook environment that allows e.g. querying the Wiki Replicas and APIs for analysis.
Quarry and Superset are a public web interfaces for SQL queries to the Wiki Replicas.

Recent changes stream(details)

Homepage

Wikimedia broadcasts every change to every Wikimedia wiki using Server Sent Events over HTTP.

Analytics Dumps(details)

Homepage

Raw pageviews, unique device estimates, mediacounts, etc.

Delimited, usually: Project, (Page title,) Count
Aggregated hourly or daily
pageviews complete – mediacounts – unique devices

WikiStats(details)

Homepage

Reports based on data dumps and server log files.

Unique visits, page views, active editors and more
Intermediate CSV files available
Graphical presentation

DBpedia(details)

DBpedia extracts structured data from Wikipedia. It allows users to run complex queries and link Wikipedia data to other data sets.

RDF, N-triplets, SPARQL endpoint, Linked Data
Billions of triplets of info in a consistent ontology

DataHub and Figshare(details)

DataHub Homepage

A collection of various Wikimedia-related datasets.

Smaller (usually one-time) surveys/studies
DBpedia-Live and others
Figshare (datasets tagged 'wikipedia')

Differential privacy(details)

Differential privacy homepage

A collection of differentially-private datasets, released daily, weekly, or monthly.

pageview data
editor/edit data
centralnotice data
search data

By data domain

The table below is a quick reference of data sources organized by data domain. For a more detailed overview of Wikimedia data domains and how to access data in each domain, use the links in the table or see Research:Data introduction.

Data domain	Data source	Access method
Content	MediaWiki REST API	API
Content	MediaWiki Action API:Parse (HTML)	API
Content	MediaWiki Action API:Revisions (wikitext)	API
Content	Wikidata:REST_API	API
Content	Wikimedia Enterprise APIs (require separate accounts, free access may have limits)	API
Content – structured data	Wikidata:REST_API	API
Content – structured data	Wikidata SPARQL query service	API
Content – structured data	Commons SPARQL query service	API
Content – structured data	DBpedia SPARQL endpoint	API
Contributions / edits	MediaWiki Action API: Revisions	API
Contributions / edits	MediaWiki Action API: Allrevisions	API
Contributions / edits	Wikimedia Analytics API: Edits data	API
Contributions / edits	MediaWiki Event Streams	API
Contributions / edits	Wikimedia Enterprise APIs (require separate accounts, free access may have limits)	API
Contributors / editors	Wikimedia Analytics API: Editors by country	API
Contributors / editors	MediaWiki Action API: Users	API
Contributors / editors	MediaWiki Action API: Usercontribs	API
Traffic	Wikimedia Analytics API: Pageviews	API
Traffic	Wikimedia Analytics API: Unique devices	API
Traffic	Wikimedia Analytics API: Mediarequests	API
Contributions / edits	Wikistats	Dashboard
Contributions / edits	XTools	Dashboard
Contributions / edits	Bitergia: technical community metrics	Dashboard
Contributors / editors	Wikistats	Dashboard
Contributors / editors	XTools	Dashboard
Contributors / editors	Bitergia: technical community metrics	Dashboard
Traffic	Devices	Dashboard
Traffic	Wikistats	Dashboard
Traffic	Readers:Pageviews and Unique Devices	Dashboard
Traffic	Pageviews Tool	Dashboard
Traffic	WikiNav	Dashboard
Content	Wikitext	Download
Content	Static HTML and Enterprise HTML (use mwparserfromhtml)	Download
Content	Knowledge gaps	Download
Content – structured data	Commons image depicts	Download
Content – structured data	Wikidata dumps (JSON, RDF, XML)	Download
Content – structured data	DBpedia.org	Download
Contributions / edits	Mediawiki_history	Download
Contributions / edits	geoeditors	Download
Contributions / edits	Differential privacy: Geoeditors	Download
Traffic	Clickstream	Download
Traffic	Pageview hourly	Download
Traffic	Unique devices	Download
Traffic	Mediacounts	Download
Traffic	Differential privacy pageviews	Download
Content	Text	MediaWiki database tables
Contributions / edits	Revision_table	MediaWiki database tables
Contributors / editors	Mediawiki_history	MediaWiki database tables
Contributors / editors	geoeditors	MediaWiki database tables
Contributors / editors	Differential privacy: Geoeditors	MediaWiki database tables
Contributors / editors	actor	MediaWiki database tables
Contributors / editors	user	MediaWiki database tables
Contributors / editors	user_groups	MediaWiki database tables
Contributors / editors	user_former_groups	MediaWiki database tables
Contributors / editors	user_properties	MediaWiki database tables
Contributors / editors	globaluser	MediaWiki database tables
Contributors / editors	user_groups	MediaWiki database tables

Data dumps

WMF releases data dumps of Wikipedia, Wikidata, and all WMF projects on a regular basis, as well as dumps of other Wikimedia-related data such as search indices and short URL mappings.

Content

XML/SQL dumps

Text of current and/or all revisions of all pages, in XML format (schema)
Metadata for current and/or all revisions of all pages, in XML format (schema)
Most database tables as SQL files
- Page-to-page link lists (pagelinks, categorylinks, imagelinks, templatelinks tables)
- Lists of pages with links outside of the project (externallinks, iwlinks, langlinks tables)
- Media metadata (image, oldimage tables)
- Info about each page (page, page_props, page_restrictions tables)
- Titles of all pages in the main namespace, i.e. all articles (*-all-titles-in-ns0.gz)
- List of all pages that are redirects and their targets (redirect table)
- Log data, including blocks, protection, deletion, uploads (logging table)
- Misc bits (interwiki, site_stats, user_groups tables)
Stub-prefixed dumps for some projects which only have header info for pages and revisions without actual content

See a more comprehensive list of what is available for download.

Other dumps

Dumps.wikimedia.org offers various other database dumps and datasets, including

Adds/changes dumps (includes no moves or deletes, plus some other limitations) (documentation)
Wikidata entity dumps – see Wikidata:Data access for more information
Various analytics datasets (described below)

Download

You can download the latest dumps for the last year (dumps.wikimedia.org/enwiki/ for English Wikipedia, dumps.wikimedia.org/dewiki/ for German Wikipedia, etc). Download mirrors offer an alternative to the download page.

Due to large file sizes, using a download tool is recommended.

There are also archives. Many older dumps can also be found at the Internet Archive.

Data format

XML dumps are in the wrapper format described at Export format (schema). Files are compressed in gzip (.gz), bzip2/lbzip2 (.bz2) and .7z formats.

SQL dumps are provided as dumps of entire tables, using mysqldump.

Some older dumps exist in various formats.

How to and examples

See examples of importing dumps in a MySQL database with step-by-step instructions.

Existing tools

Some tools are listed on the following pages, but these tools are mostly outdated and non-functional:

License

All text content is multi-licensed under the Creative Commons Attribution-ShareAlike 3.0 License (CC-BY-SA) and the GNU Free Documentation License (GFDL). Images and other files are available under different terms, as detailed on their description pages.

Support

Mailing list: xmldatadumps-l
Bug reports: Dumps Generation project in Phabricator
Design work on Dumps 2.0 replacement: Dumps Rewrite project in Phabricator

MediaWiki API

The MediaWiki API provides direct, high-level access to the data contained in MediaWiki databases. Client programs can log in to a wiki, get data, and post changes automatically by making HTTP requests.

Content

Meta information about the wiki and the logged-in user
Properties of pages, including page revisions and content, external links, categories, templates,etc.
Lists of pages that match certain criteria
See the full list of available information
- See also additional information for Wikidata and Wikidata's SPARQL query endpoint

Endpoint

To query the database you send a HTTP GET request to the desired endpoint (example https://en.wikipedia.org/w/api.php for English Wikipedia) setting the action parameter to query and defining the query details the URL.

How to and examples

API Tutorial
Example: https://en.wikipedia.org/w/api.php?action=query&prop=revisions&rvprop=content&format=xml&titles=Main%20Page fetches (action=query) the content (rvprop=content) of the most recent revision of Main Page (titles=Main%20Page) of English Wikipedia (https://en.wikipedia.org/w/api.php?) in XML format (format=xml). You can paste the URL in a browser to see the output.
More examples

Existing tools

To try out the API interactively on English Wikipedia, use the API Sandbox.

Access

To use the API, your application or client might need to log in.

Before you start, learn about the API etiquette.

Researchers could be given Special access rights on case-to-case bases.

License

All text content is multi-licensed under the Creative Commons Attribution-ShareAlike 3.0 License (CC-BY-SA) and the GNU Free Documentation License (GFDL).

Support

Frequently asked questions (FAQ)
mediawiki-api mailing list

Wiki Replicas

The Wiki Replicas (part of WMCS wikitech:Portal:Data Services) host sanitized versions of Wikimedia production MediaWiki databases.

Content

Users of various Wikimedia Cloud Services products can access the wiki Wiki Replicas databases that host sanitized copies of the databases of all Wikimedia projects including Commons.

Data format

Explore the database schema of the MediaWiki software.

How to

See the Wiki Replicas page on Wikitech on how to access the Wiki Replicas.

Support

See wikitech:Help:Cloud Services introduction#Communication and support

Recent changes stream

See EventStreams to subscribe to Recent changes on all Wikimedia wikis. This broadcasts edits and other changes as they happen.

Existing tools

See wikitech:Event Platform/EventStreams/Powered By

Analytics Datasets

Analytics Datasets on dumps.wikimedia.org offers stable and continuous datasets about web request statistics (including page views, mediacounts, unique devices), page revision history, data by country, and Wikidata QRanks.

Pageview statistics

Pageview statistics are one example. Each request of a page reaches one of Wikimedia's Varnish caching hosts. The project name and the title of the page requested are logged and aggregated hourly.

Files starting with "project" contain total hits per project per hour statistics.

Per-country pageviews data is also available, sanitized for privacy reasons. See this announcement post (June 2023).

See the README for details on the format.

You can interactively browse the page view statistics at https://pageviews.toolforge.org. More documentation on the Pageviews Analysis tool is available.

Clickstream data

The Wikipedia clickstream dataset contains counts of (referrer, resource)pairs extracted from the request logs of Wikipedia.

Geoeditors

The public "Geoeditors" dataset contains information about the monthly number of active editors from a particular country on a particular Wikipedia language edition (bucketed and redacted for privacy reasons). For some earlier years, similar data is available at [1]/[2], see also Edits by project and country of origin.

Misc datasets

Additional datasets (mostly irregular or discontinued ones) are published at https://analytics.wikimedia.org/datasets/. These include Caching research data, and AS Performance Report.

WikiStats

Wikistats is an informal but widely recognized name for a set of reports which provide monthly trend information for all Wikimedia projects and wikis.

Content

Many dashboards that display trends about reading, contributing, and content broken down by different projects such as:

unique visitors
page views (overall and mobile only)
editor activity
article count

Data format

Data is presented as charts with the option to download the underlying data.

Support

For more details on Wikistats, see wikitech:Data Platform/Systems/Wikistats 2.

DBpedia

DBpedia.org is a community effort to extract structured information from Wikipedia and to make this information available on the Web. DBpedia allows you to ask sophisticated queries against Wikipedia and to link other datasets on the Web to Wikipedia data.

Content

The English version of the DBpedia knowledge base describes millions of things, and the majority of items are classified in a consistent ontology (persons, places, creative works like music albums, films and video games, organizations like companies and educational institutions, species, diseases, etc.). Localized versions of DBpedia in more than hundred languages describe millions of things.

The data set also features:

about 2 billion pieces of information (RDF triples)
labels and abstracts for >10 million unique things in up to 111 different languages
millions of links to images, links to external web pages, data links into external RDF datasets, links to Wikipedia categories, YAGO categories
https://www.dbpedia.org/resources/ has download links for all the data sets, different formats and languages.

Data format

RDF/XML
Turtle
N-Triplets
SPARQL endpoint

Access

https://dbpedia.org/sparql is DBpedia's SPARQL endpoint.

License

DBpedia data from version 3.4 on is licensed under the terms of the Creative Commons Attribution-ShareAlike 3.0 License and the GNU Free Documentation License.

Support

Mailing list: DBpedia Discuss
Forum: https://forum.dbpedia.org/
DBpedia related publications, blog posts and projects

DataHub

The Wikimedia organization on the Open Knowledge Foundation's DataHub was established by the Wikimedia Foundation around 2013, and contains a collection of datasets about Wikipedia and other projects which mostly date from around 2013-2016.

Wikivoyage also maintains data on its own DataHub:

Hotels/restaurants/attractions data as CSV/OSM/OBF
Tourism guide for offline use

Differential privacy

The WMF privacy engineering team uses differential privacy to release data that would otherwise be too sensitive to release. This data currently only includes pageview statistics; in the future, it will include statistics about editors, centralnotice impressions and views, search, and more.

Content

Pageview data (currently only available as daily TSVs)
- 6 February 2023 – present: README / raw data
- 9 February 2017 – 5 February 2023: README / raw data
- 1 July 2015 – 8 February 2017: README / raw data

Data format

Differentially-private data is currently available in static TSV form at https://analytics.wikimedia.org/published/datasets/. Work to make this data available via API is ongoing.

License

Differentially-private data and code is available under a Creative Commons Zero license.

Support

Differential privacy at WMF homepage
DP experimentation repo
DP production repos:
- Production DP aggregations
- Production DP automation