Titel: Large web corpora for Indian languages
Personen:Kilgarriff, Adam/Duvuru, Girish
Jahr: 2011
Typ: Aufsatz
Verlag: Springer
Ortsangabe: Heidelberg/Berlin
In: Singh, Chandan/Singh Lehal, Gurpreet/Sengupta, Jyotsna/Veer Sharma, Dharam/Goyal, Vishal (Hgg.): Information Systems for Indian Languages: Proceedings of the International Conference, ICISIL 2011, Patiala, India, 9 - 11 March 2011
Untersuchte Sprachen: Englisch*English - Indische Sprachen*Indian Languages
Schlagwörter: Frequenz*frequency
Internet-Lexikografie/Online-Lexikografie*internet lexicography/online lexicography
Kollokationen/Phraseologismen/Wortverbindungen*collocations/phraseologisms/multi word items
korpusbasierte Lexikografie*corpus-based lexicography
URI: https://www.sketchengine.eu/wp-content/uploads/Large_Web_Corpora_2011.pdf
Zuletzt besucht: 10.09.2018
Abstract: For many languages there are no large, general-language corpora available. Until the web, all but the richest institutions could do little but shake their heads in dismay as corpus-building was long, slow and expensive. But with the advent of the Web it can be highly automated and thereby fast and inexpensive. In this demo we describe the 'corpus factory' method we use for collecting large web corpora for Indian and other languages. We have recently collected corpora for Hindi, Telugu, Kannada, Urdu, Gujarati, Tamil, Malayalam and Bengali. We also describe the Sketch Engine, a corpus tool that offer lots of language analysis function, and CQL, the advanced query language used by this system.