lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Erik Hatcher <e...@ehatchersolutions.com>
Subject Re: Analyzers and multiple languages
Date Fri, 13 Oct 2006 14:52:42 GMT

On Oct 13, 2006, at 3:42 AM, Antony Bowesman wrote:
> I am writing a framework that needs to be able to index documents  
> from a range of languages where just the character set of the  
> document is known.  Has anyone looked at or is using language  
> analysis to determine the language of a document in ISO-8859-1.

There is a language identifier plugin in the Nutch codebase that  
could surely be distilled (and there are plans to do so) into a  
standalone library:

	<http://svn.apache.org/viewvc/lucene/nutch/trunk/src/plugin/ 
languageidentifier/>


> What about stemming?  I see Google now says it does stemming, but  
> again here language detection seems to be a stumbling block in the  
> way of choosing the right stemmer.  Does stemming provide much of  
> an index size reduction and is it actually useful in search?

Stemming shouldn't be considered for reducing index size, but rather  
to improve a users experience in findability.  It is quite useful in  
the right situations, but it is not something that all projects  
desire, so you'd have to see if it fits your needs specifically.

	Erik



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message