lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Bob Carpenter <>
Subject Re: Analyzers and multiple languages (language detection)
Date Tue, 21 Nov 2006 22:35:56 GMT
Antony Bowesman wrote:
> Hello,
> I'm new to Lucene and wanted some advice on analyzers, stemmers and 
> language analysis.  I've got LIA, so have read it's chapters.
> I am writing a framework that needs to be able to index documents from a 
> range of languages where just the character set of the document is 
> known.  Has anyone looked at or is using language analysis to determine 
> the language of a document in ISO-8859-1.

Language ID is pretty easy.  The best way to
do it wholly within Lucene would be with a
separate index containing one document per
language, with an analyzer that returned weighted
character n-grams.  You can read about our analyzer
to do that in LIA.  This is what some
of the packages such as Gertjan van Noord's do.

If you need very high accuracy, you could also
use our language ID, which is based on a probabilistic
classifier.  You can check out our tutorial at:

Accuracy depends on the pair of languages (some are
more confusible than others), as well as length of
input (it's very hard with only one or two words,
especially if it's a a name).

- Bob Carpenter

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message