Data sets for the Generalized Language Model toolkit
by Rene Pickhardt, Thomas Gottron, Martin Koerner, Paul Georg Wagner, Till Speicher and Steffen Staab.
Download links of data sets
The data provided on this page is licensed under CC-BY-SA-3.0 (unless otherwise stated). In order to make the attribution properly you should link to this page or cite my original ACL publication.
Wikipedia data sets
Here you will find the preprocessed data sets from the English, French, German and Italian Wikipedia dumps as of November 2013. The original, unprocessed dumps can be downloaded from: http://dumps.wikimedia.org.
Since the wikipedia data dumps follow the creative commons share alike by license.
JRC Acquis Data set
The following data sets are taken from http://ipsc.jrc.ec.europa.eu/?id=198#c2730 they are provided as public domain content with the following usage conditions: http://optima.jrc.it/Acquis/JRC-Acquis.3.0/doc/licence.html. In particular: Only European Community legislation printed in the paper edition of the Official Journal of the European Union is deemed authentic.
Explaination of the format and sample data
The zip container contains the following 8 files:
Note that the test sequences are the same in testing-samples-1.txt ... testing-samples-5.txt. This means that the sequences in testing-samples-2.txt are sub sequences to the ones in testing-samples-2.txt and so own. That is wy the sequences are saved in the same order.
- normalized.txt: One sentence per line completly tokenized with a space as a delimiter.
- training.txt: 80% of the lines of normalized.txt (randomly selected)
- testing.txt: 20% complement of training.txt in normalized.txt. sentences are already anotaed with beginning and end tags for sentences e.g. <s> and </s>
- testing-samples-1.txt ... testing-samples-5.txt: about 100k randomly sampled test sequences extracted from testing.txt. The integer in the file name indicates the length of the test sequence.
Sparse data experiment
According to our paper we did experiments on succssiv smaller samples of the english wikipedia corpus. These samples are created from wiki-en.tar.bz2 and can be downloaded here:
the above files contain the training text in the file training.txt as well as about 100k testing sequences of length 1 to 5 in the files testing-samples-1.txt ... testing-samples-5.txt
Unseen training sequences
there is a number of python scripts that split the test data to seen and unseen scripts. I have to clean them up an will publish them here soon. Based on the software you should be able to do this yourself.