Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm
Precedence: bulk
Reply-To: java-user@lucene.apache.org
Received-SPF: pass (herse.apache.org: local policy)
Message-ID: <45637F4C.4030808@alias-i.com>
Date: Tue, 21 Nov 2006 17:35:56 -0500
From: Bob Carpenter <carp@alias-i.com>
User-Agent: Mozilla Thunderbird 1.0.6 (Windows/20050716)
MIME-Version: 1.0
To: java-user@lucene.apache.org
Subject: Re: Analyzers and multiple languages (language detection)
References: <452F4381.5000106@teamware.com>
In-Reply-To: <452F4381.5000106@teamware.com>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit

Antony Bowesman wrote:
> Hello,
> 
> I'm new to Lucene and wanted some advice on analyzers, stemmers and 
> language analysis.  I've got LIA, so have read it's chapters.
> 
> I am writing a framework that needs to be able to index documents from a 
> range of languages where just the character set of the document is 
> known.  Has anyone looked at or is using language analysis to determine 
> the language of a document in ISO-8859-1.

Language ID is pretty easy.  The best way to
do it wholly within Lucene would be with a
separate index containing one document per
language, with an analyzer that returned weighted
character n-grams.  You can read about our analyzer
to do that in LIA.  This is what some
of the packages such as Gertjan van Noord's do.

If you need very high accuracy, you could also
use our language ID, which is based on a probabilistic
classifier.  You can check out our tutorial at:

http://www.alias-i.com/lingpipe/demos/tutorial/langid/read-me.html

Accuracy depends on the pair of languages (some are
more confusible than others), as well as length of
input (it's very hard with only one or two words,
especially if it's a a name).

- Bob Carpenter
   Alias-i

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org