:: libTextCat :: List of languages

wiseguys > software > libtextcat > languages

The language guesser recognizes the following languages. Every combination of language and encoding has its own fingerprint.

Language Encoding(s) Notes
Afrikaans
Albanian
Amharic utf
Arabic iso8859-6, windows1256
Armenian
Basque
Belarus windows1251
Bosnian
Breton
Bulgarian iso8859-5
Catalan
Chinese (Mandarin) big5, gb2312
Croatian ascii
Czech iso8859-2
Danish
Drents A Dutch dialect
Dutch
English
Esperanto
Estonian
Finnish
French
Frisian
Georgian
German
Greek iso8859-7
Hebrew iso8859-8
Hindi
Hungarian
Icelandic
Indonesian
Irish
Italian
Japanese euc-jp, shift-jis
Korean
Latin
Latvian
Lithuanian
Malay
Manx
Marathi
Middle Frisian
Mingo Native American language
Nepali
Norwegian
Persian
Polish
Portuguese
Quechua Indigenous language of the Andean region
Romanian
Rumantsch Swiss Language
Russian iso8859-5, koi8-r, windows1251
Sanskrit
Scots
Scots Gaelic
Serbian ascii
Slovak ascii
Slovak windows1250
Slovenian ascii, iso8859-2
Spanish
Swahili
Swedish
Tagalog
Tamil
Thai
Turkish
Ukrainian koi8-r
Vietnamese
Welsh
Yiddish utf

Additions or corrections?

We strive to make this collection of fingerprints more comprehensive and more accurate. If you have a significant amount of text in a language/encoding combination that is missing in this list, please contact us. If you spot errors, please contact us. Our email-address is libtextcat AT wise DASH guys.nl

Some open questions:

Our main focus will be on compiling a list of fingerprints of UTF-8 encoded languages, since Unicode is clearly the way to go and UTF-8 is usually the best way to do Unicode. (Moreover, providing the alternate Unicode encodings will be easy once we have the UTF-8 encodings.)

© 2003 WiseGuys Internet