Wikipedia Datasets for the Hadoop Hack

Overview:

Two of the datasets pre-loaded onto our cluster are articles from the English-language version of Wikipedia.

One of these is the current article index, containing the complete current page text of every article on the site. This corpus omits "user" and "talk" pages. This dataset is approximately 20 GB in size and has all 2.5 million currently-published articles. The dataset has been provided in three versions:

  • /shared/wiki-articles/full -- contains the complete set of articles
  • /shared/wiki-articles/small -- contains a 256 MB slice of the articles. Good for small-scale testing
  • /shared/wiki-articles/micro -- contains about 10 MB of articles randomly selected. Good for unit testing. (You are encouraged to download this locally)

The other dataset is the complete revision history for a subset of the Wikipedia pages. This dataset is much larger, weighing in at 1 TB. It contains all revision snapshots of many pages from Wikipedia (an intact complete history dump was unavailable). This dataset is also available in three sizes:

  • /shared/wiki-history/full -- A 1 TB dataset that has been compressed (see section at the bottom of this document)
  • /shared/wiki-history/full-uncompressed -- The 1 TB dataset in uncompressed form. Use this only if there is trouble with the compressed data
  • /shared/wiki-history/small -- A 256 MB slice of the articles, uncompressed
  • /shared/wiki-history/micro -- About 10 MB of articles, for unit testing. (You are encouraged to download this locally)

Data Format:

The data was provided to us as a single XML file containing all the articles. A single article is represented as a page entity, which looks approximately like this:

The "unofficial DTD" as taken from http://meta.wikimedia.org/wiki/Help:Export#Export_format looks like:

The current-only and complete-history datasets both adhere to the same format; the difference is that the current-only dataset contains only one <revision> entity per page, whereas there are (potentially) several revisions per historical version.

Because parsing variable-length multi-line XML entities in Hadoop is onerous, we have preformatted the data by "flattening" it so that all the newlines have been converted to spaces within a single <page> entity, resulting in lines which look something like:

As part of the "flattening" process, the XML document has also been split into a large number of smaller files corresponding roughly to the intended map task input size. You can now use the TextInputFormat to read the corpus; each individual article will be presented to you as a single record in the Mapper (in the "Text" value component).

The HTML inside the <text> .. </text> segment corresponding to the actual page layout is escaped so that becomes &lt;foo&gt;.

Links within Wikipedia itself are represented by [[destination]]. If the name of the destination page is not intended to be the hyperlink text, then the link text is also contained in the link tag like: [[destination_page|display_text]].

The complete-history corpus contains a very large number of revisions for some pages (anywhere from a few KB to 3 GB of revisions for individual pages). Putting all of this data on one line was infeasible, and not particularly helpful to you. Therefore, the data has been denormalized in the complete-history corpus. For each <page> ... </page>, all the data between the <page> tag (inclusive) and the first <revision> tag (exclusive) is known as the preamble. The page was split so that each line of the dataset contains one or more <revision> ... </revision> entities from the same <page>, each separated by an arbitrary amount of whitespace. The preamble for the page has been prepended to all lines containing revisions for a page. The </page> tag may be missing from some or all lines. (Thus, this is no longer a valid XML document.)

Thus, the data:

Will be flattened to something like:

Compression on the wiki-history Dataset:

Reading 1 TB of data is a time-consuming process. To speed up your processing, we have compressed the wiki-history dataset in HDFS. The data in /shared/wiki-history/full has been processed by a filter which used an identity mapper to write the (LongWritable, Text) pairs returned by TextInputFormat back to HDFS as block-compressed SequenceFiles. You should read this directory with the SequenceFileInputFormat. The map input key class is still LongWritable (representing the byte offset of the record in a different arrangement of the data; effectively useless, but kept for type-compatibility with TextInputFormat); the map input value class is Text (the same Text as was on the line in the original text files).

The test datasets /shared/wiki-history/small and /micro are to be read with the TextInputFormat. The keys there are also LongWritable, representing the byte offsets of the lines given as values to the Mapper. These were left uncompressed to make manual inspection of their contents more straightforward.

All of the /shared/wiki-articles entries are uncompressed and should use TextInputFormat.

Further Information and Hack Ideas:

There are a number of directions one could go explore with these datasets. Here are a few to get started with:

  • Wikipedia contains a rough "social network" of contributors; they are identified in <username>...</username> entities. What are the different editors doing? How do they interact with one another?
  • Articles in Wikipedia are organized into "Categories" by human cataloguers. Can articles be classified automatically? Can this process guess how the human editors would have done it?
  • Perform semantic analysis of the texts, either as they evolve over time, or within a static version of a page.
  • Do individual editors have a "style" associated with their grammar that can be extrapolated?
  • What do the edit timestamps tell us about the contents of different pages?