lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Bob Carpenter <c...@alias-i.com>
Subject Re: Language detection library
Date Mon, 07 May 2007 17:35:58 GMT

>> Anyone knows of a good language detection library that can detect what
>> language a document (text) is ?

Language detection is easy.  It's just a simple
text classification problem.

One way you can do this is using Lucene
itself.  Create a so-called pseudo-document
for each language consisting of lots of text
(1 MB or more, ideally).  Then build a Lucene
index using a character n-gram tokenizer.
Eg. "John Smith" tokenizes to "Jo", "oh",
"hn", "n ", " S", "Sm", "mi", "it", "th"
with 2-grams.

You'll have to make sure to index beyond the
first 1000 tokens or whatever Lucene is set to
by default.

To do language ID, just treat the language
to be identified as the basis of a query.
Parse it using the same character n-gram
tokenizer.  The highest-scoring result is
the answer and if two score high, you know
there may be some ambiguity.  You can't trust
Lucene's normalized scoring for rejection,
though.

Make sure the tokenizer includes spaces as
well as non-space characters (though all
spaces may be normalized to a single whitespace).
Using more orders (1-grams, 2-grams, 3-grams,
etc.) gives more accuracy; the IDF weighting
is quite sensible here and will work out the
details for the counts for you.

For a more sophisticated approach, check out
LingPipe's language ID tutorial, which is
based on probabilistic character language models.
Think of it as similar to the Lucene model but
with different term weighting.

    http://www.alias-i.com/lingpipe/demos/tutorial/langid/read-me.html

Here's accuracy vs. input length on a set of 15
languages from the Leipzig Corpus collection (just
one of the many evals in the tutorial):

#chars  accuracy
1	22.59%
2	34.82%
4	58.55%
8	81.17%
16	92.45%
32	97.33%
64	98.99%
128	99.67%

The end of the tutorial has references to other
popular language ID packages online (e.g. TextCat,
which is Gertjan van Noord's Perl package).  And it
also has references to the technical background
on TF/IDF classification with n-grams and
character language models.

- Bob Carpenter
   Alias-i

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message