lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Erick Erickson" <>
Subject Re: Analyzers and multiple languages
Date Fri, 13 Oct 2006 12:22:53 GMT
This won't be *really* helpful, but I remember this being discussed at some
length a while ago. You'd be able to see some good info if you searched the
list archive, probably for language

I didn't pay much attention since this isn't something I'm concerned with
lately, so I can't be much real help...


On 10/13/06, Antony Bowesman <> wrote:
> Hello,
> I'm new to Lucene and wanted some advice on analyzers, stemmers and
> language
> analysis.  I've got LIA, so have read it's chapters.
> I am writing a framework that needs to be able to index documents from a
> range
> of languages where just the character set of the document is known.  Has
> anyone
> looked at or is using language analysis to determine the language of a
> document
> in ISO-8859-1.
> Is it worth doing or does StandardAnalyzer cope well with most European
> languages as long as it is provided with a suitable multi-lingual set of
> stop words.
> What about stemming?  I see Google now says it does stemming, but again
> here
> language detection seems to be a stumbling block in the way of choosing
> the
> right stemmer.  Does stemming provide much of an index size reduction and
> is it
> actually useful in search?
> Antony
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message