wiseguys > software > libtextcat > languages
The language guesser recognizes the following languages. Every combination of language and encoding has its own fingerprint.
| Language | Encoding(s) | Notes |
|---|---|---|
| Afrikaans | ||
| Albanian | ||
| Amharic | utf | |
| Arabic | iso8859-6, windows1256 | |
| Armenian | ||
| Basque | ||
| Belarus | windows1251 | |
| Bosnian | ||
| Breton | ||
| Bulgarian | iso8859-5 | |
| Catalan | ||
| Chinese (Mandarin) | big5, gb2312 | |
| Croatian | ascii | |
| Czech | iso8859-2 | |
| Danish | ||
| Drents | A Dutch dialect | |
| Dutch | ||
| English | ||
| Esperanto | ||
| Estonian | ||
| Finnish | ||
| French | ||
| Frisian | ||
| Georgian | ||
| German | ||
| Greek | iso8859-7 | |
| Hebrew | iso8859-8 | |
| Hindi | ||
| Hungarian | ||
| Icelandic | ||
| Indonesian | ||
| Irish | ||
| Italian | ||
| Japanese | euc-jp, shift-jis | |
| Korean | ||
| Latin | ||
| Latvian | ||
| Lithuanian | ||
| Malay | ||
| Manx | ||
| Marathi | ||
| Middle Frisian | ||
| Mingo | Native American language | |
| Nepali | ||
| Norwegian | ||
| Persian | ||
| Polish | ||
| Portuguese | ||
| Quechua | Indigenous language of the Andean region | |
| Romanian | ||
| Rumantsch | Swiss Language | |
| Russian | iso8859-5, koi8-r, windows1251 | |
| Sanskrit | ||
| Scots | ||
| Scots Gaelic | ||
| Serbian | ascii | |
| Slovak | ascii | |
| Slovak | windows1250 | |
| Slovenian | ascii, iso8859-2 | |
| Spanish | ||
| Swahili | ||
| Swedish | ||
| Tagalog | ||
| Tamil | ||
| Thai | ||
| Turkish | ||
| Ukrainian | koi8-r | |
| Vietnamese | ||
| Welsh | ||
| Yiddish | utf |
We strive to make this collection of fingerprints more comprehensive and more accurate. If you have a significant amount of text in a language/encoding combination that is missing in this list, please contact us. If you spot errors, please contact us. Our email-address is libtextcat AT wise DASH guys.nl
Some open questions:
Our main focus will be on compiling a list of fingerprints of UTF-8 encoded languages, since Unicode is clearly the way to go and UTF-8 is usually the best way to do Unicode. (Moreover, providing the alternate Unicode encodings will be easy once we have the UTF-8 encodings.)
© 2003 WiseGuys Internet