Titel: Building Linguistic Corpora from Wikipedia Articles and Discussions
Personen:Margaretha, Eliza/Lüngen, Harald
Jahr: 2014
Typ: Aufsatz
Periodikum: Journal for Language Technology and Computational Linguistics
Seiten: 59-82
Band: 29
Heft: 2
Untersuchte Sprachen: Deutsch*German
Schlagwörter: korpusbasierte Lexikografie*corpus-based lexicography
Nutzerbeteiligung*user contribution
Redaktionssystem*lexicographic editor
TEI*TEI
XML/SGML*XML/SGML
Medium: Online
URI: https://ids-pub.bsz-bw.de/frontdoor/index/index/docId/3330
Zuletzt besucht: 19.10.2020
Abstract: Wikipedia is a valuable resource, useful as a lingustic corpus or a dataset for many kinds of research. We built corpora from Wikipedia articles and talk pages in the I5 format, a TEI customisation used in the German Reference Corpus (Deutsches Referenzkorpus - DeReKo). Our approach is a two-stage conversion combining parsing using the Sweble parser, and transformation using XSLT stylesheets. The conversion approach is able to successfully generate rich and valid corpora regardless of languages. We also introduce a method to segment user contributions in talk pages into postings.