Abstract: |
Large corpora are of increasing interest for lexicography. If a large corpus is to be used for several lexicography projects, quality is crucial. The corpus pre-processing pipeline as used in the corpora project "Deutscher Wortschatz" is discussed in detail. The resulting full-form dictionary also contains statistical information like word frequencies and word co-occurrences. Present and forthcoming usage scenarios for manual and automatic look-up are presented. Having different corpora for different text genres or different time spans, a joint lookup of these corpora will show variations in word usage. From the lexicographer's point of view, the statistical data can be used to provide raw data for several kinds of dictionaries, including thesauri, collocation dictionaries, phraseology and, of course, frequency dictionaries. |