Titel: Building Large Monolingual Dictionaries at the Leipzig Corpora Collection: From 100 to 200 Languages
Personen:Goldhahn, Dirk/Eckart, Thomas/Quasthoff, Uwe
Jahr: 2012
Typ: Aufsatz
Verlag: European Language Resources Association (ELRA)
Ortsangabe: Istanbul
In: Calzolari, Nicoletta/Choukri, Khalid/Declerck, Thierry/Doğan, Mehmet U./Maegaard, Bente/Mariani, Joseph/Odijk, Jan/Piperidis, Stelios (Hgg.): Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC 2012), Istanbul, 23 - 25 May 2012
Seiten: 759-765
Schlagwörter: Datenbank*data base
einsprachige Lexikografie*monolingual lexicography
Internet-Lexikografie/Online-Lexikografie*internet lexicography/online lexicography
Kookkurrenzanalyse*collocation analysis
korpusbasierte Lexikografie*corpus-based lexicography
Medium: Online
URI: http://www.lrec-conf.org/proceedings/lrec2012/pdf/327_Paper.pdf
Zuletzt besucht: 17.09.2018
Abstract: The Leipzig Corpora Collection offers free online access to 136 monolingual dictionaries enriched with statistical information. In this paper we describe current advances of the project in collecting and processing text data automatically for a large number of languages. Our main interest lies in languages of "low density", where only few text data exists online. The aim of this approach is to create monolingual dictionaries and statistical information for a high number of new languages and to expand the existing dictionaries, opening up new possibilities for linguistic typology and other research. Focus of this paper will be set on the infrastructure for the automatic acquisition of large amounts of monolingual text in many languages from various sources. Preliminary results of the collection of text data will be presented. The mainly language-independent framework for preprocessing, cleaning and creating the corpora and computing the necessary statistics will also be depicted.