lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Erick Erickson" <>
Subject Re: Automatic analyzer resolving based on Locale
Date Wed, 09 May 2007 17:59:20 GMT
Well, I don't see how this can work. In your example, you'd
index "werkte". But how are you going to search such that
this matches "werk"? No matter what analyzers you use? It looks
like you're thinking about either stemming or wildcarding, but
I really suspect that stemming is language dependent. So
I don't think you're going to get reasonable results when mixing
languages if you try to stem....

You might try something like indexing AND searching in a single
language. Say the standard low-ascii character set, with all the
accented characters folded into their unaccented forms. Of course,
the data you search on isn't the same data you display, so you
could still display the data in the native language.

Mind you, I think this is tricky, with hidden problems. But it
might work.

As far as the stemming is concerned, I also suppose you could
do something like use the native stemmer at index time, *then* fold
all the characters into values 20-127, THEN index the result. And do
the same at query time. I *still* think you'd get some "interesting"
results, but it could suffice. Although this requires you to know what
the language is that the person writing the query is using.

But no matter what you do, I'm pretty sure you'll get some results that
won't be what you want. The results may be close enough though.


On 5/9/07, Geoffrey De Smet <> wrote:
> We 'd use a different index for each locale's language that is
> configured, however this might have an impact on performance.
> Would this be attainable (maybe some day in lucene)?
> - Use an IndexEverythingAnalyzer for writing,
> so "werk", "werkte", "gewerkt" and "en" is indexed as-is when they are
> encountered.
> - And then use a DutchAnalyzer for reading,
> which if I ask "werk" searches for "werk", "werkte" and "gewerkt",
> and also ignores stop words like "en" in the query.
> EnglishAnalyzer would search with "werk" for "werk", "werkes", "werked",
> ...
> - It might seem a bad idea to mix several languages in the same index,
> but in reality few data comes with the meta-data which declares the
> language of the data is written in.
> With kind regards,
> Geoffrey De Smet
> Chris Hostetter schreef:
> > : There is nothing canned that I know of. I'm also not sure how this
> > : would be used. If you're using a single index, how are you going
> > : to index, then search using these analyzers? Or is there some
> > : other magic going on?
> >
> > i suspect the use case is "shipped" software product, where you want to
> > have one jar that works anywhere, but you want the analyzer used to
> depend
> > on Locale of the JVM the software is installed in.
> >
> > Personally, i would advise against auto-selecting an Analyzer based on
> the
> > runtime Locale ... it's a fine approach when dealing with purely
> transient
> > data (ie: parsing Dates iput into a form) but it's a bad idea for
> > persistant data (ie: formating dates to write them to a file) because
> the
> > user could change their Locale and now the index they built the last
> time
> > they ran your softare doesn't work anymore.
> >
> > just make it an option configurable at install time.
> >
> >
> >
> > -Hoss
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message