Buckets:

clem
/

structured-wikipedia-bucket

72.6 GB

124 files

Updated 11 days ago

Ctrl+K

Name	Size	Uploaded	Xet hash
20240916.en		11 days ago	1 items
20240916.fr		11 days ago	1 items
enwiki		11 days ago	87 items
frwiki		11 days ago	27 items
images		11 days ago	6 items
.gitattributes	2.42 kB xet	11 days ago	7a44c6d8
README.md	26.3 kB xet	11 days ago	7fca8a3d

README.md

Dataset Card for Wikimedia Structured Wikipedia

Dataset Description

Homepage: https://enterprise.wikimedia.com/
Point of Contact: Stephanie Delbecque
Total size: 44.42 GiB

Quick Links

Dataset Summary

Pre-parsed English and French Wikipedia articles, extracted using the Wikimedia Enterprise Snapshot API.

This dataset contains all articles of the English and French language editions of Wikipedia, pre-parsed and output as structured data with a consistent schema. The dataset is provided in Parquet format, optimized for high-performance analytical queries and efficient storage. This version uses a unified, pinned schema across all files, making it compatible with DuckDB, pandas, Polars, and Apache Spark out of the box.

New in this dataset:

Parsed references and citations, connecting Wikipedia's knowledge with its sources of truth.
Parsed tables, one of the most information-heavy sections of Wikipedia pages.
Credibility signals, for example referenceneed and referencerisk, signaling where information may not be sufficiently backed by sources.
Improvements to list parsing, including nested lists, ordered lists, and definition lists.
Article-body images are parsed and available in the article body section payload.

Invitation for Feedback

The dataset is built as part of the Structured Contents initiative and based on Wikimedia Enterprise HTML snapshots. This updated version includes pre-parsed Wikipedia abstracts, short descriptions, main image, infoboxes, article sections, tables, references, citations, lists, article images, and more.

For updates, follow the Wikimedia Enterprise blog and MediaWiki Quarterly software updates.

We welcome feedback to help refine and improve this dataset. Please share suggestions or issues on the Wikimedia Enterprise Meta-Wiki discussion page or on this dataset’s Hugging Face discussion page.

The contents of this dataset are collectively written and curated by a global volunteer community. Please see the Attribution Information section for licensing and attribution guidance.

Supported Tasks and Leaderboards

The dataset in its structured form is generally helpful for a wide variety of tasks, including model development, pre-training, alignment, fine-tuning, retrieval-augmented generation, updating, testing, and benchmarking.

We would love to hear more about your use cases.

Languages

English

BCP 47 Language Code: en
Wikipedia edition: English Wikipedia
Manual of Style: English Wikipedia Manual of Style

French / Français

BCP 47 Language Code: fr
Wikipedia edition: French Wikipedia
Manual of Style: French Wikipedia conventions

There is one language edition for English and one for French. They encompass national and cultural variations of spelling, vocabulary, grammar, and usage. Within a Wikipedia language edition, no national variety is officially preferred over others. The rule of thumb is that the conventions of a particular language variety should be followed consistently within a given article.

As part of this beta Wikimedia Enterprise Structured Contents initiative, the team is expanding coverage of language editions. To request a specific language edition, please contact us on the Meta-Wiki talk page and explain your use case.

Dataset Structure

Data Instances

Each row represents one Wikipedia article snapshot stored in Parquet format.

Most scalar and moderately nested metadata fields are stored as native Arrow/Parquet types. Recursive or schema-unstable fields are stored as JSON-encoded strings.

Example abbreviated Parquet row, shown as JSON for readability:

{
  "name": "Josephine Baker",
  "identifier": 255083,
  "url": "https://en.wikipedia.org/wiki/Josephine_Baker",
  "description": "American-born French dancer...",
  "abstract": "Freda Josephine Baker...",
  "main_entity": {
    "identifier": "Q151972",
    "url": "https://www.wikidata.org/entity/Q151972"
  },
  "version": {
    "identifier": 123456789,
    "editor": {
      "identifier": 123,
      "name": "Example editor"
    },
    "scores": {}
  },
  "image": {
    "content_url": "https://upload.wikimedia.org/...",
    "width": 250,
    "height": 350
  },
  "infoboxes": "[{...}]",
  "sections": "[{...}]",
  "tables": "[{...}]",
  "references": [
    {
      "identifier": "...",
      "metadata": "{...}"
    }
  ]
}

Timestamp

Dataset extracted: 13 May 2026

Included Releases

Language	Snapshot identifier	Rows	Shards	Size
English	`enwiki_namespace_0`	7,597,149	86	34.61 GiB
French	`frwiki_namespace_0`	2,871,732	26	9.81 GiB

Total rows: 10,468,881

Total Parquet size: 44.42 GiB

Data Fields

The data fields are the same across the dataset. Noteworthy fields include:

name - title of the article.
identifier - ID of the article.
url - URL of the article.
version - metadata related to the latest specific revision of the article.
version.editor - editor-specific signals that can help contextualize the revision.
version.scores - returns assessments by ML models on the likelihood of a revision being reverted.
main_entity - Wikidata QID the article is related to.
abstract - lead section summarizing what the article is about.
description - one-sentence description of the article for quick reference.
images - the image object is found as a top-level object to refer to the main image for the article. The contents of an image object can also exist within an images array in other sections of the article payload, to refer to images used within the article body.
infoboxes - parsed information from the side panel (infobox) on the Wikipedia article.
sections - parsed sections of the article, including links, lists, citations, etc.
tables - tables are defined at the top level of the article object in a tables array, each with a unique identifier. Sections or other parts of the article can reference these tables using their identifiers.
references - parsed references including identifiers that link them to citations in sections.

The full list of fields is available in the Wikimedia Enterprise data dictionary.

Processing

Downloaded the snapshot tarball from the Wikimedia Enterprise API.
Streamed each .ndjson shard through a normalization pass:
- JSON-encoded fields: sections, infoboxes, tables, and references[].metadata are stored as JSON-encoded strings.
- These fields either have recursive nesting that exceeds Apache Arrow’s C Data Interface limit or are open-dictionary structures whose keys vary across articles.
- Decode with json.loads on read.
- Struct field ordering was canonicalized alphabetically and recursively so schemas match byte-for-byte across shards.
Wrote one Parquet file per shard with zstd compression.
Unified per-shard schemas with pa.unify_schemas; pinned the result to schema.json; re-cast every shard so embedded schemas are identical.

Loading

from datasets import load_dataset
import json

en = load_dataset(
    "wikimedia/structured-wikipedia",
    "enwiki_namespace_0",
    split="train",
    streaming=True,
)

fr = load_dataset(
    "wikimedia/structured-wikipedia",
    "frwiki_namespace_0",
    split="train",
    streaming=True,
)

ds = en

row = next(iter(ds))

print(row["name"], row["url"])

sections = json.loads(row["sections"]) if row["sections"] else None
infoboxes = json.loads(row["infoboxes"]) if row["infoboxes"] else None
tables = json.loads(row["tables"]) if row["tables"] else None
for ref in row["references"] or []:
    metadata = json.loads(ref["metadata"]) if ref.get("metadata") else None

Polars

import polars as pl

df = pl.scan_parquet(
    "hf://datasets/wikimedia/structured-wikipedia/enwiki/data/*.parquet"
)

print(
    df.select(["name", "url"])
      .head()
      .collect()
)

DuckDB

import duckdb

duckdb.sql(
    '''
    SELECT name, url
    FROM 'hf://datasets/wikimedia/structured-wikipedia/enwiki/data/*.parquet'
    LIMIT 10
    '''
).show()

PyArrow

import pyarrow.dataset as ds

dataset = ds.dataset(
    "hf://datasets/wikimedia/structured-wikipedia/enwiki/data/",
    format="parquet",
)

table = dataset.to_table(
    columns=["name", "url"]
)

Reading the Pinned Schema

import json
import pyarrow as pa

payload = json.loads(open("schema.json").read())

schema = pa.ipc.read_schema(
    pa.py_buffer(bytes.fromhex(payload["arrow_ipc_hex"]))
)

Data Splits

This dataset is provided as a single train split for compatibility with dataset-loading tooling.

Dataset Creation

Curation Rationale

This dataset was created as part of the larger Structured Contents initiative at Wikimedia Enterprise, with the aim of making Wikimedia data more machine-readable.

Although Wikipedia is highly structured to the human eye, it is non-trivial to extract the knowledge within it in a machine-readable manner. Projects, languages, and domains each have their own community experts and ways of structuring data, supported by templates and best practices. A specific example addressed in this release is article tables. Wikipedia’s editorial communities work to keep tables populated with factual and structured information, and this dataset aims to make that information accessible at scale without bespoke parsing systems.

The dataset also includes links to Wikidata Q Identifiers and links to main and infobox images to facilitate easier access to additional information on specific topics.

Credibility signal fields are also included. These can help users decide when, how, and why to use what is in the dataset. These fields reflect editorial policies created and maintained by Wikipedia editing communities over more than 20 years. Many of these signals are found under the version object, while other objects such as protection and watchers_count offer related insight.

This is an early beta release of pre-parsed Wikipedia articles in bulk. It is shared to improve transparency in the development process, gather insight into current use cases, and collect feedback from the AI community. There will be limitations, described in the Other Known Limitations section, but in line with Wikimedia values, we believe it is better to share early, share often, and respond to feedback.

You can also test more languages on an article-by-article basis through the beta Structured Contents On-Demand endpoint with a free account.

Attribution is core to the sustainability of Wikimedia projects. It drives new editors and donors to Wikipedia. With consistent attribution, the cycle of content creation and reuse helps ensure that encyclopedic content of high quality, reliability, and verifiability continues to be written on Wikipedia and remains available for reuse through datasets such as this one.

Users of this dataset are expected to conform to Wikimedia attribution expectations. Detailed attribution guidance is provided in the Attribution Information section.

Beyond attribution, there are many ways to contribute to and support the Wikimedia movement. To discuss specific circumstances, please contact Nicholas Perry from the Wikimedia Foundation technical partnerships team at nperry@wikimedia.org. You can also contact us on the Wikimedia Enterprise Meta-Wiki discussion page or on this dataset’s Hugging Face discussion page.

Source Data

Initial Data Collection and Normalization

The dataset is built from the Wikimedia Enterprise Structured Contents API and focuses on the English and French Wikipedia article namespace, namespace 0, also known as mainspace.

Who are the source language producers?

Wikipedia is a human-generated corpus of free knowledge, written, edited, and curated by a global community of editors since 2001.

It is one of the largest and most accessed educational resources in history, accessed over 20 billion times by half a billion people each month. Wikipedia represents almost 25 years of work by its community: the creation, curation, and maintenance of millions of articles on distinct topics.

This dataset includes the complete article contents of two Wikipedia language editions written by the respective communities:

Annotations

Annotation process

N/A

The dataset does not contain additional annotations.

Who are the annotators?

N/A

Personal and Sensitive Information

The Wikipedia community and the Wikimedia Foundation, which operates Wikipedia, maintain robust policies and guidelines regarding personal and sensitive information. These policies are intended both to avoid inappropriate personal or sensitive information within articles and to protect the privacy of Wikimedia contributors.

The Wikimedia Foundation Privacy Policy is available here: Privacy Policy.

Transparency reports covering the Wikimedia Foundation’s responses to requests to alter or remove content from the projects, and requests to provide nonpublic information about users, are available here: Transparency Reports.

Among many editorial policies regarding personal and sensitive information, particular care is paid by the community to biographies of living people. Details for each language community’s responses and norms can be found here: Biographies of Living People policies.

Considerations for Using the Data

Social Impact of Dataset

Wikipedia’s articles are read over 20 billion times by half a billion people each month. Wikipedia does not belong to, or come from, a single culture or language. It is an example of mass international cooperation across languages and continents. Wikipedia is the only website among the world’s most visited that is operated by a nonprofit organization.

Wikimedia projects have been used, and upsampled, as a core source of qualitative data in AI, ML, and LLM development. The Wikimedia Foundation has published an article on the value of Wikipedia in the age of generative AI: Wikipedia in the age of generative AI.

There is also a community article about why Wikimedia data matters for ML on the Hugging Face blog. It highlights that Wikimedia data is rich and diverse, multimodal, community-curated, and openly licensed: Wikimedia data matters for ML.

Discussion of Biases

While consciously trying to present an editorially neutral point of view, Wikipedia’s content reflects the biases of the society it comes from. This includes various gaps, notably in both the proportion of biographies of women and the proportion of editors who identify as women. Other significant gaps include linguistic accessibility, technical accessibility, and censorship.

Because the content is written by its readers, ensuring the widest possible access to the content is crucial to reducing the biases of the content itself. There is continuous work to redress these biases through various social and technical efforts, both centrally and at the grassroots level around the world.

Other Known Limitations

This is a beta version. The following limitations may apply:

A small percentage of duplicated, deleted, or missed articles may be part of the snapshot. Duplicates can be filtered out by looking at the highest version.identifier, which represents the most up-to-date revision of the article.
Revision discrepancies may occur due to limitations with long articles.
JSON-encoded volatile fields: to maintain a stable and unified schema across all shards, highly polymorphic fields such as infoboxes, tables, and references.metadata are stored as JSON-serialized strings. This prevents schema collisions caused by the unpredictable structure of Wikipedia templates. Users should apply json.loads() to these columns during preprocessing.
Nesting depth caps: Wikipedia’s hierarchical section structure can exceed the recursion limits of many columnar data tools. To ensure the files are loadable by standard libraries, deeply nested structures are flattened or JSON-encoded.

Please let us know if there are other limitations that are not covered above.

Additional Information

Dataset Curators

This dataset was created by the Wikimedia Enterprise team of the Wikimedia Foundation as part of the Structured contents initiative.

Attribution Information

Attribution is essential to honor the open licenses governing Wikimedia’s community-driven content. It is also essential for fair acknowledgment and active awareness of Wikimedia’s community-driven content, and it is a key factor in the continued growth and sustainability of the free knowledge ecosystem.

Reusers of Wikimedia content can provide relevant, up-to-date, and carefully curated content for their audiences while also helping to keep the circle of free, human-produced knowledge alive by protecting trust, ensuring transparency, and fostering participation.

The Wikimedia Attribution Framework provides guidelines that data reusers can follow to ensure that sources remain clear, recognizable, and consistent in external contexts. We recommend visiting the Reuse Scenarios section to learn how to attribute Wikimedia content according to your use case.

Below is a simple overview of the main attribution signals that can be used when attributing Wikimedia content in line with its license and where those signals can be found in this dataset.

Essential Signals

Source: State the project, for example “English Wikipedia,” using the is_part_of.url or is_part_of.identifier field.
Title: The title of a page can be derived from its URL or found in the name field.
Link: The link to the Wikipedia page itself can be found in the url field. You can use the name field and hyperlink it to the URL.
Credit: This dataset does not include individual author metadata for images. To satisfy licensing, ensure that the Wikipedia URL is immediately accessible so users can find the original author and license on Wikimedia Commons, or ideally add this metadata yourself.
License: The license of the data being surfaced can be found in the license field. Be aware that data from linked URLs in this dataset may have a different license, for example media files from Wikimedia Commons that are often linked from Wikipedia articles.
Modification: Clearly state if the content has been modified, summarized, transformed, or aggregated.
Brand mark: Use official Wikimedia visual identity where appropriate for quick recognition. See also the Wikimedia Foundation Visual Identity Guidelines.

Example: Wikipedia W Logo Source: NASA on English Wikipedia, CC BY-SA 4.0

Beyond the Basics: Trust & Relevance and Ecosystem Growth signals

We encourage reusers to go beyond legal minimums by surfacing Trust & Relevance signals, such as Contributor counts, Reference counts, and Last-updated timestamps, etc. These signals tell your users that the information is backed by a living, collaborative community of human editors.

Ecosystem growth signals are designed to sustain the cycle of free knowledge by inviting users to participate or donate to Wikimedia projects.

Tools & Resources

Wikimedia Attribution Framework: Detailed guidelines for various reuse scenarios (Search, AI, Social Media).
Attribution API: A developer-friendly tool to programmatically fetch rich attribution signals not included in this static snapshot.
Read more: A Better Way to Give Credit: Introducing the Wikimedia Attribution Framework and API
Support: To request a brand attribution walkthrough or a customized solution, contact brandattribution@wikimedia.org.

Citation Information

If you are using this dataset for research, model training, or benchmarking, please cite this specific distribution so others can reproduce your work.

General citation:

Wikimedia Enterprise Structured Contents Dataset (2026), English and French editions. Distributed by Wikimedia Enterprise, via Hugging Face.

@ONLINE{structured-wikipedia,
  author = {Wikimedia Enterprise, Wikimedia Foundation},
  title = {Structured Contents Wikipedia},
  month = {may},
  year = {2026}
}

Total size: 72.6 GB

Files: 124

Last updated: May 22

Pre-warmed CDN: US EU US EU