lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From maurits van wijland <m.vanwijl...@quicknet.nl>
Subject Re: [PATCH] Refactoring QueryParser.jj, setLowercaseWildcardTerms()
Date Thu, 13 Feb 2003 06:57:42 GMT
Hi all,

Maybe it we should start using stemming in a different maner. Look at it
from the perspective
of queryexpansion. In case we store stems in a different table, we will not
have this problem!

So, each token in stored in the index as a term.
Each term is stemmed with the appropriate stemmer
Store each stem and unstemed term in a separate index.

We could then, search using the terms entered, and firstfind all the terms
that match the WildcardQuery. Next,you coulde use the terms found, and then
stem them.
>From there, you retrieve all the terms related to that stem!
Finally, search for documents with all terms retrieved.

This would give an extra option for end users, turning query expansion on or
off.

Your thoughts, please.

kind regards,

Maurits.

----- Original Message -----
From: "Tatu Saloranta" <tatu@hypermall.net>
To: "Lucene Developers List" <lucene-dev@jakarta.apache.org>
Sent: Thursday, February 13, 2003 2:43 AM
Subject: Re: [PATCH] Refactoring QueryParser.jj, setLowercaseWildcardTerms()


> On Wednesday 12 February 2003 11:39, Christoph Kiehl wrote:
> > Hi Doug,
> >
> > > Also, I think we should lowercase prefix and wildcard queries by
> ...
> > > wildcard searches. What do others think?
> >
> > For the StandardAnalyzer this might work, but for the GermanAnalyzer,
there
>
> Solving this problem should be easier after refactoring,  just
> override 'getPrefixQuery()' and 'getWildcardQuery' (see below for one
possible
> idea of what could be done).
>
> Another possibility would be to have another property for enabling use of
same
> analyzer used for normal terms for wildcard/prefix queries.
>
> However, using typical analyzers is not something one usually wants to do
> for couple of reasons:
>
> - Wildcards are discarded by analyzer, so wildcard query will get broken
(ie.
>   one needs wildcard-char - aware analyzer)
> - Stemming can only be done for prefix queries (what is stem of,
>   say, "hä*er"?), and even then it might not produce stem one would
>   want. For example, for prefix query "men*" might be 'stemmed' to
>   "man*", and user might be perplexed at why documents with
>   words like "meningitis" and "menstrual" did not match (ok, that is
>   a contrived example, but hope you get the idea).
>  In a way, you could think that user is doing "manual stemming", using
>  a stem of a word with prefix query.
>
> In case of german, if umlaut chars are typically converted, perhaps you
could
> create a GermanQueryParser.java that just extends default query parser,
and
> does necessary transformation for wildcard/prefix queries? Since there
> already exists separate language-dependant stemmers,  this might make
sense?
>
> > is also the problem with Umlauts (ä,ö,ü) turned into vowels (a,o,u)
while
> > indexing. An example: "Häuser" is the plural of "Haus". If I index
"Häuser"
> > it is stemmed to "hau". If I do for example a search for "häus*" nothing
is
>
> Not "haus"?
>
> > found, because "häus" is not stemmed. If I would analyze "häus*" I
should
> > get "hau*". The problem is, that now you do not only get "Häuser" but
also
> > "Haus" as result. But I think it is better to get more results than no
> > result. This is perhaps a special problem with the GermanAnalyzer. May
be
> > there could be an option to use the Analyzer also for wildcard queries.
So
> > I can turn it on in my case and defaults to off.
> > Hope you understand my problem ;)
>
> Yes I do... I don't even dare to think of problems finnish analyzer might
> have, with stemming. :-)
>
> -+ Tatu +-
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-dev-help@jakarta.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


Mime
View raw message