Titel: |
Large web corpora for Indian languages |
Personen: | Kilgarriff, Adam/Duvuru, Girish |
Jahr: |
2011 |
Typ: |
Aufsatz |
Verlag: |
Springer |
Ortsangabe: |
Heidelberg/Berlin |
In: |
Singh, Chandan/Singh Lehal, Gurpreet/Sengupta, Jyotsna/Veer Sharma, Dharam/Goyal, Vishal (Hgg.): Information Systems for Indian Languages: Proceedings of the International Conference, ICISIL 2011, Patiala, India, 9 - 11 March 2011 |
Untersuchte Sprachen: |
Englisch*English - Indische Sprachen*Indian Languages |
Schlagwörter: |
Frequenz*frequency
Internet-Lexikografie/Online-Lexikografie*internet lexicography/online lexicography
Kollokationen/Phraseologismen/Wortverbindungen*collocations/phraseologisms/multi word items
korpusbasierte Lexikografie*corpus-based lexicography
|
URI: |
https://www.sketchengine.eu/wp-content/uploads/Large_Web_Corpora_2011.pdf |
Zuletzt besucht: |
10.09.2018 |
Abstract: |
For many languages there are no large, general-language corpora available. Until the web, all but the richest institutions could do little but shake their heads in dismay as corpus-building was long, slow and expensive. But with the advent of the Web it can be highly automated and thereby fast and inexpensive. In this demo we describe the 'corpus factory' method we use for collecting large web corpora for Indian and other languages. We have recently collected corpora for Hindi, Telugu, Kannada, Urdu, Gujarati, Tamil, Malayalam and Bengali. We also describe the Sketch Engine, a corpus tool that offer lots of language analysis function, and CQL, the advanced query language used by this system. |