lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chris Hostetter <hossman_luc...@fucit.org>
Subject Re: Automatic analyzer resolving based on Locale
Date Wed, 09 May 2007 18:01:15 GMT

: - Use an IndexEverythingAnalyzer for writing,
: so "werk", "werkte", "gewerkt" and "en" is indexed as-is when they are
: encountered.
:
: - And then use a DutchAnalyzer for reading,
: which if I ask "werk" searches for "werk", "werkte" and "gewerkt",
: and also ignores stop words like "en" in the query.
: EnglishAnalyzer would search with "werk" for "werk", "werkes", "werked", ...

that sounds liek a fairly simplistic view of things ... sucessful analysis
tends to require compatible efforts taken when indexing and querying ...
when indexing you know context, and structure; when querying you have much
more limited information to start with.  Most Stemming algorithms i've
seen end to rely more on work done when indexing then work done when
searching -- even if you had an extremely well maintained dictionary based
stemmer, you would probably wind up wanting to do stem normalation when
indexing (as opposed to expansion when searching) to keep your search
times fast.

: - It might seem a bad idea to mix several languages in the same index,
: but in reality few data comes with the meta-data which declares the
: language of the data is written in.

i'm told by the people who seem to know a lot about this that this is
where langauge detection comes ito play .. when indexing documnts, even
without metadata, you tend to have a large neough "chunk" of text to make
assumptions about the language.




-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message