Buckets:
| Name | Size | Uploaded | Xet hash |
|---|---|---|---|
| 20240916.en | 1 items | ||
| 20240916.fr | 1 items | ||
| enwiki | 87 items | ||
| frwiki | 27 items | ||
| images | 6 items | ||
| .gitattributes | 2.42 kB xet | 7a44c6d8 | |
| README.md | 26.3 kB xet | 7fca8a3d |
Dataset Card for Wikimedia Structured Wikipedia
Dataset Description
- Homepage: https://enterprise.wikimedia.com/
- Point of Contact: Stephanie Delbecque
- Total size: 44.42 GiB
Quick Links
- Wikimedia Enterprise
- Structured Contents Documentation
- Data Dictionary
- Wikimedia Attribution Framework
- Meta-Wiki Discussion
Dataset Summary
Pre-parsed English and French Wikipedia articles, extracted using the Wikimedia Enterprise Snapshot API.
This dataset contains all articles of the English and French language editions of Wikipedia, pre-parsed and output as structured data with a consistent schema. The dataset is provided in Parquet format, optimized for high-performance analytical queries and efficient storage. This version uses a unified, pinned schema across all files, making it compatible with DuckDB, pandas, Polars, and Apache Spark out of the box.
New in this dataset:
- Parsed references and citations, connecting Wikipedia's knowledge with its sources of truth.
- Parsed tables, one of the most information-heavy sections of Wikipedia pages.
- Credibility signals, for example
referenceneedandreferencerisk, signaling where information may not be sufficiently backed by sources. - Improvements to list parsing, including nested lists, ordered lists, and definition lists.
- Article-body images are parsed and available in the article body section payload.
Invitation for Feedback
The dataset is built as part of the Structured Contents initiative and based on Wikimedia Enterprise HTML snapshots. This updated version includes pre-parsed Wikipedia abstracts, short descriptions, main image, infoboxes, article sections, tables, references, citations, lists, article images, and more.
For updates, follow the Wikimedia Enterprise blog and MediaWiki Quarterly software updates.
We welcome feedback to help refine and improve this dataset. Please share suggestions or issues on the Wikimedia Enterprise Meta-Wiki discussion page or on this dataset’s Hugging Face discussion page.
The contents of this dataset are collectively written and curated by a global volunteer community. Please see the Attribution Information section for licensing and attribution guidance.
Supported Tasks and Leaderboards
The dataset in its structured form is generally helpful for a wide variety of tasks, including model development, pre-training, alignment, fine-tuning, retrieval-augmented generation, updating, testing, and benchmarking.
We would love to hear more about your use cases.
Languages
English
- BCP 47 Language Code:
en - Wikipedia edition: English Wikipedia
- Manual of Style: English Wikipedia Manual of Style
French / Français
- BCP 47 Language Code:
fr - Wikipedia edition: French Wikipedia
- Manual of Style: French Wikipedia conventions
There is one language edition for English and one for French. They encompass national and cultural variations of spelling, vocabulary, grammar, and usage. Within a Wikipedia language edition, no national variety is officially preferred over others. The rule of thumb is that the conventions of a particular language variety should be followed consistently within a given article.
As part of this beta Wikimedia Enterprise Structured Contents initiative, the team is expanding coverage of language editions. To request a specific language edition, please contact us on the Meta-Wiki talk page and explain your use case.
Dataset Structure
Data Instances
Each row represents one Wikipedia article snapshot stored in Parquet format.
Most scalar and moderately nested metadata fields are stored as native Arrow/Parquet types. Recursive or schema-unstable fields are stored as JSON-encoded strings.
Example abbreviated Parquet row, shown as JSON for readability:
{
"name": "Josephine Baker",
"identifier": 255083,
"url": "https://en.wikipedia.org/wiki/Josephine_Baker",
"description": "American-born French dancer...",
"abstract": "Freda Josephine Baker...",
"main_entity": {
"identifier": "Q151972",
"url": "https://www.wikidata.org/entity/Q151972"
},
"version": {
"identifier": 123456789,
"editor": {
"identifier": 123,
"name": "Example editor"
},
"scores": {}
},
"image": {
"content_url": "https://upload.wikimedia.org/...",
"width": 250,
"height": 350
},
"infoboxes": "[{...}]",
"sections": "[{...}]",
"tables": "[{...}]",
"references": [
{
"identifier": "...",
"metadata": "{...}"
}
]
}
Timestamp
Dataset extracted: 13 May 2026
Included Releases
| Language | Snapshot identifier | Rows | Shards | Size |
|---|---|---|---|---|
| English | enwiki_namespace_0 |
7,597,149 | 86 | 34.61 GiB |
| French | frwiki_namespace_0 |
2,871,732 | 26 | 9.81 GiB |
Total rows: 10,468,881
Total Parquet size: 44.42 GiB
Data Fields
The data fields are the same across the dataset. Noteworthy fields include:
name- title of the article.identifier- ID of the article.url- URL of the article.version- metadata related to the latest specific revision of the article.version.editor- editor-specific signals that can help contextualize the revision.version.scores- returns assessments by ML models on the likelihood of a revision being reverted.main_entity- Wikidata QID the article is related to.abstract- lead section summarizing what the article is about.description- one-sentence description of the article for quick reference.images- the image object is found as a top-level object to refer to the main image for the article. The contents of an image object can also exist within animagesarray in other sections of the article payload, to refer to images used within the article body.infoboxes- parsed information from the side panel (infobox) on the Wikipedia article.sections- parsed sections of the article, including links, lists, citations, etc.tables- tables are defined at the top level of the article object in atablesarray, each with a unique identifier. Sections or other parts of the article can reference these tables using their identifiers.references- parsed references including identifiers that link them to citations in sections.
The full list of fields is available in the Wikimedia Enterprise data dictionary.
Processing
- Downloaded the snapshot tarball from the Wikimedia Enterprise API.
- Streamed each
.ndjsonshard through a normalization pass:- JSON-encoded fields:
sections,infoboxes,tables, andreferences[].metadataare stored as JSON-encoded strings. - These fields either have recursive nesting that exceeds Apache Arrow’s C Data Interface limit or are open-dictionary structures whose keys vary across articles.
- Decode with
json.loadson read. - Struct field ordering was canonicalized alphabetically and recursively so schemas match byte-for-byte across shards.
- JSON-encoded fields:
- Wrote one Parquet file per shard with zstd compression.
- Unified per-shard schemas with
pa.unify_schemas; pinned the result toschema.json; re-cast every shard so embedded schemas are identical.
Loading
from datasets import load_dataset
import json
en = load_dataset(
"wikimedia/structured-wikipedia",
"enwiki_namespace_0",
split="train",
streaming=True,
)
fr = load_dataset(
"wikimedia/structured-wikipedia",
"frwiki_namespace_0",
split="train",
streaming=True,
)
ds = en
row = next(iter(ds))
print(row["name"], row["url"])
sections = json.loads(row["sections"]) if row["sections"] else None
infoboxes = json.loads(row["infoboxes"]) if row["infoboxes"] else None
tables = json.loads(row["tables"]) if row["tables"] else None
for ref in row["references"] or []:
metadata = json.loads(ref["metadata"]) if ref.get("metadata") else None
Polars
import polars as pl
df = pl.scan_parquet(
"hf://datasets/wikimedia/structured-wikipedia/enwiki/data/*.parquet"
)
print(
df.select(["name", "url"])
.head()
.collect()
)
DuckDB
import duckdb
duckdb.sql(
'''
SELECT name, url
FROM 'hf://datasets/wikimedia/structured-wikipedia/enwiki/data/*.parquet'
LIMIT 10
'''
).show()
PyArrow
import pyarrow.dataset as ds
dataset = ds.dataset(
"hf://datasets/wikimedia/structured-wikipedia/enwiki/data/",
format="parquet",
)
table = dataset.to_table(
columns=["name", "url"]
)
Reading the Pinned Schema
import json
import pyarrow as pa
payload = json.loads(open("schema.json").read())
schema = pa.ipc.read_schema(
pa.py_buffer(bytes.fromhex(payload["arrow_ipc_hex"]))
)
Data Splits
This dataset is provided as a single train split for compatibility with dataset-loading tooling.
Dataset Creation
Curation Rationale
This dataset was created as part of the larger Structured Contents initiative at Wikimedia Enterprise, with the aim of making Wikimedia data more machine-readable.
Although Wikipedia is highly structured to the human eye, it is non-trivial to extract the knowledge within it in a machine-readable manner. Projects, languages, and domains each have their own community experts and ways of structuring data, supported by templates and best practices. A specific example addressed in this release is article tables. Wikipedia’s editorial communities work to keep tables populated with factual and structured information, and this dataset aims to make that information accessible at scale without bespoke parsing systems.
The dataset also includes links to Wikidata Q Identifiers and links to main and infobox images to facilitate easier access to additional information on specific topics.
Credibility signal fields are also included. These can help users decide when, how, and why to use what is in the dataset. These fields reflect editorial policies created and maintained by Wikipedia editing communities over more than 20 years. Many of these signals are found under the version object, while other objects such as protection and watchers_count offer related insight.
This is an early beta release of pre-parsed Wikipedia articles in bulk. It is shared to improve transparency in the development process, gather insight into current use cases, and collect feedback from the AI community. There will be limitations, described in the Other Known Limitations section, but in line with Wikimedia values, we believe it is better to share early, share often, and respond to feedback.
You can also test more languages on an article-by-article basis through the beta Structured Contents On-Demand endpoint with a free account.
Attribution is core to the sustainability of Wikimedia projects. It drives new editors and donors to Wikipedia. With consistent attribution, the cycle of content creation and reuse helps ensure that encyclopedic content of high quality, reliability, and verifiability continues to be written on Wikipedia and remains available for reuse through datasets such as this one.
Users of this dataset are expected to conform to Wikimedia attribution expectations. Detailed attribution guidance is provided in the Attribution Information section.
Beyond attribution, there are many ways to contribute to and support the Wikimedia movement. To discuss specific circumstances, please contact Nicholas Perry from the Wikimedia Foundation technical partnerships team at nperry@wikimedia.org. You can also contact us on the Wikimedia Enterprise Meta-Wiki discussion page or on this dataset’s Hugging Face discussion page.
Source Data
Initial Data Collection and Normalization
The dataset is built from the Wikimedia Enterprise Structured Contents API and focuses on the English and French Wikipedia article namespace, namespace 0, also known as mainspace.
Who are the source language producers?
Wikipedia is a human-generated corpus of free knowledge, written, edited, and curated by a global community of editors since 2001.
It is one of the largest and most accessed educational resources in history, accessed over 20 billion times by half a billion people each month. Wikipedia represents almost 25 years of work by its community: the creation, curation, and maintenance of millions of articles on distinct topics.
This dataset includes the complete article contents of two Wikipedia language editions written by the respective communities:
Annotations
Annotation process
N/A
The dataset does not contain additional annotations.
Who are the annotators?
N/A
Personal and Sensitive Information
The Wikipedia community and the Wikimedia Foundation, which operates Wikipedia, maintain robust policies and guidelines regarding personal and sensitive information. These policies are intended both to avoid inappropriate personal or sensitive information within articles and to protect the privacy of Wikimedia contributors.
The Wikimedia Foundation Privacy Policy is available here: Privacy Policy.
Transparency reports covering the Wikimedia Foundation’s responses to requests to alter or remove content from the projects, and requests to provide nonpublic information about users, are available here: Transparency Reports.
Among many editorial policies regarding personal and sensitive information, particular care is paid by the community to biographies of living people. Details for each language community’s responses and norms can be found here: Biographies of Living People policies.
Considerations for Using the Data
Social Impact of Dataset
Wikipedia’s articles are read over 20 billion times by half a billion people each month. Wikipedia does not belong to, or come from, a single culture or language. It is an example of mass international cooperation across languages and continents. Wikipedia is the only website among the world’s most visited that is operated by a nonprofit organization.
Wikimedia projects have been used, and upsampled, as a core source of qualitative data in AI, ML, and LLM development. The Wikimedia Foundation has published an article on the value of Wikipedia in the age of generative AI: Wikipedia in the age of generative AI.
There is also a community article about why Wikimedia data matters for ML on the Hugging Face blog. It highlights that Wikimedia data is rich and diverse, multimodal, community-curated, and openly licensed: Wikimedia data matters for ML.
Discussion of Biases
While consciously trying to present an editorially neutral point of view, Wikipedia’s content reflects the biases of the society it comes from. This includes various gaps, notably in both the proportion of biographies of women and the proportion of editors who identify as women. Other significant gaps include linguistic accessibility, technical accessibility, and censorship.
Because the content is written by its readers, ensuring the widest possible access to the content is crucial to reducing the biases of the content itself. There is continuous work to redress these biases through various social and technical efforts, both centrally and at the grassroots level around the world.
Other Known Limitations
This is a beta version. The following limitations may apply:
- A small percentage of duplicated, deleted, or missed articles may be part of the snapshot. Duplicates can be filtered out by looking at the highest
version.identifier, which represents the most up-to-date revision of the article. - Revision discrepancies may occur due to limitations with long articles.
- JSON-encoded volatile fields: to maintain a stable and unified schema across all shards, highly polymorphic fields such as
infoboxes,tables, andreferences.metadataare stored as JSON-serialized strings. This prevents schema collisions caused by the unpredictable structure of Wikipedia templates. Users should applyjson.loads()to these columns during preprocessing. - Nesting depth caps: Wikipedia’s hierarchical section structure can exceed the recursion limits of many columnar data tools. To ensure the files are loadable by standard libraries, deeply nested structures are flattened or JSON-encoded.
Please let us know if there are other limitations that are not covered above.
Additional Information
Dataset Curators
This dataset was created by the Wikimedia Enterprise team of the Wikimedia Foundation as part of the Structured contents initiative.
Attribution Information
Attribution is essential to honor the open licenses governing Wikimedia’s community-driven content. It is also essential for fair acknowledgment and active awareness of Wikimedia’s community-driven content, and it is a key factor in the continued growth and sustainability of the free knowledge ecosystem.
Reusers of Wikimedia content can provide relevant, up-to-date, and carefully curated content for their audiences while also helping to keep the circle of free, human-produced knowledge alive by protecting trust, ensuring transparency, and fostering participation.
The Wikimedia Attribution Framework provides guidelines that data reusers can follow to ensure that sources remain clear, recognizable, and consistent in external contexts. We recommend visiting the Reuse Scenarios section to learn how to attribute Wikimedia content according to your use case.
Below is a simple overview of the main attribution signals that can be used when attributing Wikimedia content in line with its license and where those signals can be found in this dataset.
Essential Signals
- Source: State the project, for example “English Wikipedia,” using the
is_part_of.urloris_part_of.identifierfield. - Title: The title of a page can be derived from its URL or found in the
namefield. - Link: The link to the Wikipedia page itself can be found in the
urlfield. You can use thenamefield and hyperlink it to the URL. - Credit: This dataset does not include individual author metadata for images. To satisfy licensing, ensure that the Wikipedia URL is immediately accessible so users can find the original author and license on Wikimedia Commons, or ideally add this metadata yourself.
- License: The license of the data being surfaced can be found in the
licensefield. Be aware that data from linked URLs in this dataset may have a different license, for example media files from Wikimedia Commons that are often linked from Wikipedia articles. - Modification: Clearly state if the content has been modified, summarized, transformed, or aggregated.
- Brand mark: Use official Wikimedia visual identity where appropriate for quick recognition. See also the Wikimedia Foundation Visual Identity Guidelines.
Example:
Source: NASA on English Wikipedia, CC BY-SA 4.0
Beyond the Basics: Trust & Relevance and Ecosystem Growth signals
We encourage reusers to go beyond legal minimums by surfacing Trust & Relevance signals, such as Contributor counts, Reference counts, and Last-updated timestamps, etc. These signals tell your users that the information is backed by a living, collaborative community of human editors.
Ecosystem growth signals are designed to sustain the cycle of free knowledge by inviting users to participate or donate to Wikimedia projects.
Tools & Resources
- Wikimedia Attribution Framework: Detailed guidelines for various reuse scenarios (Search, AI, Social Media).
- Attribution API: A developer-friendly tool to programmatically fetch rich attribution signals not included in this static snapshot.
- Read more: A Better Way to Give Credit: Introducing the Wikimedia Attribution Framework and API
- Support: To request a brand attribution walkthrough or a customized solution, contact brandattribution@wikimedia.org.
Citation Information
If you are using this dataset for research, model training, or benchmarking, please cite this specific distribution so others can reproduce your work.
General citation:
Wikimedia Enterprise Structured Contents Dataset (2026), English and French editions. Distributed by Wikimedia Enterprise, via Hugging Face.
@ONLINE{structured-wikipedia,
author = {Wikimedia Enterprise, Wikimedia Foundation},
title = {Structured Contents Wikipedia},
month = {may},
year = {2026}
}
- Total size
- 72.6 GB
- Files
- 124
- Last updated
- May 22
- Pre-warmed CDN
- US EU US EU