lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tatu Saloranta <t...@hypermall.net>
Subject Re: [PATCH] Refactoring QueryParser.jj, setLowercaseWildcardTerms()
Date Thu, 13 Feb 2003 01:43:43 GMT
On Wednesday 12 February 2003 11:39, Christoph Kiehl wrote:
> Hi Doug,
>
> > Also, I think we should lowercase prefix and wildcard queries by
...
> > wildcard searches. What do others think?
>
> For the StandardAnalyzer this might work, but for the GermanAnalyzer, there

Solving this problem should be easier after refactoring,  just
override 'getPrefixQuery()' and 'getWildcardQuery' (see below for one possible 
idea of what could be done).

Another possibility would be to have another property for enabling use of same 
analyzer used for normal terms for wildcard/prefix queries.

However, using typical analyzers is not something one usually wants to do
for couple of reasons:

- Wildcards are discarded by analyzer, so wildcard query will get broken (ie.
  one needs wildcard-char - aware analyzer)
- Stemming can only be done for prefix queries (what is stem of,
  say, "hä*er"?), and even then it might not produce stem one would
  want. For example, for prefix query "men*" might be 'stemmed' to
  "man*", and user might be perplexed at why documents with
  words like "meningitis" and "menstrual" did not match (ok, that is
  a contrived example, but hope you get the idea).
 In a way, you could think that user is doing "manual stemming", using
 a stem of a word with prefix query.

In case of german, if umlaut chars are typically converted, perhaps you could
create a GermanQueryParser.java that just extends default query parser, and 
does necessary transformation for wildcard/prefix queries? Since there 
already exists separate language-dependant stemmers,  this might make sense?

> is also the problem with Umlauts (ä,ö,ü) turned into vowels (a,o,u) while
> indexing. An example: "Häuser" is the plural of "Haus". If I index "Häuser"
> it is stemmed to "hau". If I do for example a search for "häus*" nothing is

Not "haus"?

> found, because "häus" is not stemmed. If I would analyze "häus*" I should
> get "hau*". The problem is, that now you do not only get "Häuser" but also
> "Haus" as result. But I think it is better to get more results than no
> result. This is perhaps a special problem with the GermanAnalyzer. May be
> there could be an option to use the Analyzer also for wildcard queries. So
> I can turn it on in my case and defaults to off.
> Hope you understand my problem ;)

Yes I do... I don't even dare to think of problems finnish analyzer might 
have, with stemming. :-)

-+ Tatu +-



---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


Mime
View raw message