lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Antony Bowesman <...@teamware.com>
Subject Analyzers and multiple languages
Date Fri, 13 Oct 2006 07:42:57 GMT
Hello,

I'm new to Lucene and wanted some advice on analyzers, stemmers and language 
analysis.  I've got LIA, so have read it's chapters.

I am writing a framework that needs to be able to index documents from a range 
of languages where just the character set of the document is known.  Has anyone 
looked at or is using language analysis to determine the language of a document 
in ISO-8859-1.

Is it worth doing or does StandardAnalyzer cope well with most European 
languages as long as it is provided with a suitable multi-lingual set of stop words.

What about stemming?  I see Google now says it does stemming, but again here 
language detection seems to be a stumbling block in the way of choosing the 
right stemmer.  Does stemming provide much of an index size reduction and is it 
actually useful in search?

Antony


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message