wiseguys > software > libtextcat > languages
The language guesser recognizes the following languages. Every combination of language and encoding has its own fingerprint.
Language | Encoding(s) | Notes |
---|---|---|
Afrikaans | ||
Albanian | ||
Amharic | utf | |
Arabic | iso8859-6, windows1256 | |
Armenian | ||
Basque | ||
Belarus | windows1251 | |
Bosnian | ||
Breton | ||
Bulgarian | iso8859-5 | |
Catalan | ||
Chinese (Mandarin) | big5, gb2312 | |
Croatian | ascii | |
Czech | iso8859-2 | |
Danish | ||
Drents | A Dutch dialect | |
Dutch | ||
English | ||
Esperanto | ||
Estonian | ||
Finnish | ||
French | ||
Frisian | ||
Georgian | ||
German | ||
Greek | iso8859-7 | |
Hebrew | iso8859-8 | |
Hindi | ||
Hungarian | ||
Icelandic | ||
Indonesian | ||
Irish | ||
Italian | ||
Japanese | euc-jp, shift-jis | |
Korean | ||
Latin | ||
Latvian | ||
Lithuanian | ||
Malay | ||
Manx | ||
Marathi | ||
Middle Frisian | ||
Mingo | Native American language | |
Nepali | ||
Norwegian | ||
Persian | ||
Polish | ||
Portuguese | ||
Quechua | Indigenous language of the Andean region | |
Romanian | ||
Rumantsch | Swiss Language | |
Russian | iso8859-5, koi8-r, windows1251 | |
Sanskrit | ||
Scots | ||
Scots Gaelic | ||
Serbian | ascii | |
Slovak | ascii | |
Slovak | windows1250 | |
Slovenian | ascii, iso8859-2 | |
Spanish | ||
Swahili | ||
Swedish | ||
Tagalog | ||
Tamil | ||
Thai | ||
Turkish | ||
Ukrainian | koi8-r | |
Vietnamese | ||
Welsh | ||
Yiddish | utf |
We strive to make this collection of fingerprints more comprehensive and more accurate. If you have a significant amount of text in a language/encoding combination that is missing in this list, please contact us. If you spot errors, please contact us. Our email-address is libtextcat AT wise DASH guys.nl
Some open questions:
Our main focus will be on compiling a list of fingerprints of UTF-8 encoded languages, since Unicode is clearly the way to go and UTF-8 is usually the best way to do Unicode. (Moreover, providing the alternate Unicode encodings will be easy once we have the UTF-8 encodings.)
© 2003 WiseGuys Internet