:: libTextCat :: Documentation

wiseguys > software > libtextcat > docs


Do the familiar dance:

This will install the library in /usr/local/lib/ and the createfp binary in /usr/local/bin.

The library is known to compile flawlessly on the following platforms:

If you manage to get it working on other systems, please drop us a note, and we'll proudly add you to this page.

Quickstart: language guesser

Assuming that you have successfully compiled the library, you still need some language models to start guessing languages. If you don't feel like creating them yourself (cf. Creating your own fingerprints below), you can use the excellent collection of over 70 language models provided in Gertjan van Noord's "TextCat" package. You can find these models and a matching configuration file in the langclass directory:

Paste some text onto the commandline, and watch it get classified.

Using the API

Classifying the language of a textbuffer can be as easy as:

#include "textcat.h"
void *h = textcat_Init( "conf.txt" );
printf( "Language: %s\n", textcat_Classify(h, buffer, 400);

Creating your own fingerprints

The createfp program allows you to easily create your own document fingerprints. Just feed it an example document on standard input, and store the standard output:

% createfp < mydocument.txt > myfingerprint.txt

Put the names of your fingerprints in a configuration file, add some id's and you're ready to classify.

A word on character encodings

The library assumes very little about encodings. A couple of caveats though:

For our next release, we will strive to make the library completely agnostic as to which encoding is used.

Performance tuning

This library was made with efficiency in mind. There are couple of parameters you may wish to tweak if you intend to use it for other tasks than language guessing.

The most important thing is buffer size. For reliable language guessing the classifier only needs a few kilobytes max. So don't feed it 100KB of text unless you are creating a fingerprint.

If you insist on feeding the classifier lots of text, try fiddling with TABLEPOW, which determines the size of the hash table that is used to store the n-grams. Making it too small will result in many hashtable clashes, making it too large will cause wild memory behaviour and both are bad for the performance.

Putting the most probable models at the top of the list in your config file improves performance, because this will raise the threshold for likely candidates more quickly.

Since the speed of the classifier is roughly linear with respect to the number of models, you should consider how many models you really need. In case of language guessing: do you really want to recognize every language ever invented?

© 2003 WiseGuys Internet B.V.