lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Yonik Seeley" <yo...@apache.org>
Subject Re: Mixing Case and Case-Insensitive Searching
Date Fri, 11 May 2007 21:19:00 GMT
On 5/11/07, Walt Stoneburner <walt.stoneburner@gmail.com> wrote:
> In this tutorial he stresses not once, not twice, but three times that
> the same Analyzer that is used to build an index -must- also be used
> when performing a Query.  There is great detail explaining why this is
> so.
>
> However, in order to get our magic to work, we need to violate this
> rule in a very clever way.

Yeah, "compatible" analyzer would be a better way to put it.  Using
the same analyzer for anything that produces multiple tokens at the
same position is normally wrong.
Solr allows specification of a "query" analyzer and an "index"
analyzer for these cases.

> STEP ONE: Building an index that has both case-sensitive and
> case-insensitive tokens in it.

Yep, your approach sounds fine, and will work in phrase queries (which
the two-field solution currently can't handle).  The greater
difficulty lies in making it generic (working for many analyzers,
etc).

> This step is where things get complicated.  It turns out that
> StandardAnalyzer, which uses the StandardTokenizer, throws away dollar
> signs.  So, it doesn't matter how many you type in your query, they
> all vanish, never giving you the opportunity to do anything with them
> downstream.

This points out the difficulty of doing this in a *generic* way.
Better than a "$" would be a flag on the Token IMO.  Not currently
really supported by lucene, but you could perhaps subclass Token.


> Bringing it all together, it's now possible to user your new query
> version token analyzer with the QueryParser.  And calling .parse()
> with dollar sign prefixed strings will search for exact-case matches,
> where omitting it works like the regular old Lucene we all know and
> love.
>
> The down side...?  The index has twice as many tokens.

I've also considered case-insensitive support at the Term-Enum level.
It would make lookups slower, but the index wouldn't be much bigger (it would
be slightly bigger because one would index everything w/o lowercasing).

> I'd love to see a formal syntax like this officially enter the Lucene
> standard query language someday.
>
> If someone can figure point me at how to do this without twiddling
> Lucene's code directly, I'd be happy to contribute the modification.

If you picked a token prefix/postfix that would pass through
the QueryParser w/o a syntax error, the necessary manipulation could
all be done in the Analyzer/TokenFilter.  Much easier, but perhaps not
as nice a syntax.

-Yonik

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message