lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Doron Cohen <DOR...@il.ibm.com>
Subject Re: Multi language indexing
Date Tue, 08 May 2007 00:51:37 GMT
bhecht <bhecht@ams-sys.com> wrote on 07/05/2007 10:26:27:

> I have implemented my own analyzer for each country.
> So as I see it, when I index these records, I want to
> provide lucene, with a specific analyzer per record
> i'm indexing.
>
> When a user performs a query in my JSF form, I will
> use the country value he entered, to get the needed
> analyzer, and query lucene with the users query and
> the needed analyzer.
>
> The user may also choose not to enter a country value
> to his search, and here comes in the solution you gave
> me, to duplicate each field, and index it using a non
> stemming analyzer (A standard analyzer without stop
> words defined).
>
> Am I going the right direction?

Sounds ok to me except that there seems to be a mix
between stemming and stop-words elimination. Perhaps
just a typo in the above text, but anyhow while the
StandardAnalyzer constructor takes a stopwords list
parameter and would eliminate these words (e.g. "is"),
it would not do stemming (e.g "knives" --> "knive").
(Though both a stop-list and a stemming algorithm
are language specific.)

So, rephrasing the discussion so far, assuming:

1) a single field "F" (for simplicity),
2) (doc) language always known at indexing
3) (user) language sometimes known at search

I think a resonable solution might be:

1) use PerFieldanalyzerWrapper
2) index each doc to F and to F_LAN
3) F would be language neutral - no
   stemming and no stop words elimination
4) F_LAN (e.g. F_en) would be language specific,
   so a specific language stopwords list would be
   used, and a specific stemmer would be used.
5) Search would go to F_LAN when the language is
   known and to F when the language is not known,
   using language specific analysis as while indexing.
6) Note Karl's mentioning having both F and F_LAN at
   search, assigning higher boost to F_LAN. Useful when
   there is some uncertainty on the "marked language".

There can be other considerations - for instance (1) the
certainty of language id; (2) fallback to English when the
language is unknown...

Note that SnowballFilter can be used for applying
stemming on the output of StandardAnalyzer.

Doron



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message