Data sets for the Generalized Language Model toolkit

by Rene Pickhardt, Thomas Gottron, Martin Koerner, Paul Georg Wagner, Till Speicher and Steffen Staab.

Download links of data sets

The data provided on this page is licensed under CC-BY-SA-3.0 (unless otherwise stated). In order to make the attribution properly you should link to this page or cite my original ACL publication.

Wikipedia data sets

Here you will find the preprocessed data sets from the English, French, German and Italian Wikipedia dumps as of November 2013. The original, unprocessed dumps can be downloaded from: Since the wikipedia data dumps follow the creative commons share alike by license.

JRC Acquis Data set

The following data sets are taken from they are provided as public domain content with the following usage conditions: In particular: Only European Community legislation printed in the paper edition of the Official Journal of the European Union is deemed authentic.

Explaination of the format and sample data

The zip container contains the following 8 files:

Note that the test sequences are the same in testing-samples-1.txt ... testing-samples-5.txt. This means that the sequences in testing-samples-2.txt are sub sequences to the ones in testing-samples-2.txt and so own. That is wy the sequences are saved in the same order.

Sparse data experiment

According to our paper we did experiments on succssiv smaller samples of the english wikipedia corpus. These samples are created from wiki-en.tar.bz2 and can be downloaded here: the above files contain the training text in the file training.txt as well as about 100k testing sequences of length 1 to 5 in the files testing-samples-1.txt ... testing-samples-5.txt

Unseen training sequences

there is a number of python scripts that split the test data to seen and unseen scripts. I have to clean them up an will publish them here soon. Based on the software you should be able to do this yourself.