lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tatu Saloranta <t...@hypermall.net>
Subject Re: DO NOT REPLY [Bug 17954] - no hits when doing wildcard queries with words containing german umlauts
Date Fri, 14 Mar 2003 05:29:38 GMT
On Thursday 13 March 2003 10:13, bugzilla@apache.org wrote:
[ok ok, I'll be replying against the warnings]
>
> http://nagoya.apache.org/bugzilla/show_bug.cgi?id=17954
>
> no hits when doing wildcard queries with words containing german umlauts
...
> ------- Additional Comments From otis@apache.org  2003-03-13 17:13 -------
> Oh, I meant the test case that includes the code.
> Since you sent HTML with umlauts, my guess is that something changes the
> tokens with umlauts on their way into the indexer (e.g. HTML parser, your
> analyzer, something else)
>
> I'm tempted to close this bug as INVALID, so please send self-enclosed code
> sample that includes indexing and searching part and demonstrates the
> problem you are describing.

Yes, it's very likely it's the difference between content that gets indexed 
through analyser, and prefix/wildcard query that doesn't get analysed.

Perhaps QueryParser just needs to have (optional) secondary
Analyzer (or perhaps two, actually, as prefix queries are easier to tokenize 
than full wildcard queries) that can be set to make these terms analysed
properly. As was previously discussed, using just standard analyser is (and 
can not be) 100% reliable, but some experience suggested that it often works 
well enough (using simple heuristics).

If anyone wants to work on this, another very useful piece would then be 
WildcardAnalyzer that would not consider '*' and '?' to be stop chars but, 
say, just normal word charaters. Combine this with lowercasing, and in case 
of German, umlaut removal, and the problem reported should be solvable?

-+ Tatu +-


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


Mime
View raw message