Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm
Precedence: bulk
Reply-To: java-user@lucene.apache.org
Received-SPF: pass (herse.apache.org: domain of yseeley@gmail.com designates
 64.233.166.178 as permitted sender)
DomainKey-Signature: a=rsa-sha1; c=nofws;
        d=gmail.com; s=beta;
        h=received:message-id:date:from:sender:to:subject:in-reply-to:mime-version:content-type:content-transfer-encoding:content-disposition:references:x-google-sender-auth;
        b=TlG0X8GMyUC0Ww0t+6WDqVnrrgMlhBX+hbWHcLNUOQmvlV//QycIj17xQlzL2vk0fVIhOje+CfIv1nxEnj9Lzo1wZND7f3ZoLY9DHXImG6xxK/yQCMgR49Z5XdXYy9qOUG1jD2on+2g4tvMPYdKrZ8n5/RhhNfu9B3xlIOlHvxw=
Message-ID: <c68e39170705111419w1e3ccf1bs8af6ef52c3b6b0ab@mail.gmail.com>
Date: Fri, 11 May 2007 17:19:00 -0400
From: "Yonik Seeley" <yonik@apache.org>
Sender: yseeley@gmail.com
To: java-user@lucene.apache.org
Subject: Re: Mixing Case and Case-Insensitive Searching
In-Reply-To: <ae6f38f60705111349o6e042ce6r30b89b52b0277f07@mail.gmail.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
Content-Disposition: inline
References: <ae6f38f60705111349o6e042ce6r30b89b52b0277f07@mail.gmail.com>

On 5/11/07, Walt Stoneburner <walt.stoneburner@gmail.com> wrote:
> In this tutorial he stresses not once, not twice, but three times that
> the same Analyzer that is used to build an index -must- also be used
> when performing a Query.  There is great detail explaining why this is
> so.
>
> However, in order to get our magic to work, we need to violate this
> rule in a very clever way.

Yeah, "compatible" analyzer would be a better way to put it.  Using
the same analyzer for anything that produces multiple tokens at the
same position is normally wrong.
Solr allows specification of a "query" analyzer and an "index"
analyzer for these cases.

> STEP ONE: Building an index that has both case-sensitive and
> case-insensitive tokens in it.

Yep, your approach sounds fine, and will work in phrase queries (which
the two-field solution currently can't handle).  The greater
difficulty lies in making it generic (working for many analyzers,
etc).

> This step is where things get complicated.  It turns out that
> StandardAnalyzer, which uses the StandardTokenizer, throws away dollar
> signs.  So, it doesn't matter how many you type in your query, they
> all vanish, never giving you the opportunity to do anything with them
> downstream.

This points out the difficulty of doing this in a *generic* way.
Better than a "$" would be a flag on the Token IMO.  Not currently
really supported by lucene, but you could perhaps subclass Token.


> Bringing it all together, it's now possible to user your new query
> version token analyzer with the QueryParser.  And calling .parse()
> with dollar sign prefixed strings will search for exact-case matches,
> where omitting it works like the regular old Lucene we all know and
> love.
>
> The down side...?  The index has twice as many tokens.

I've also considered case-insensitive support at the Term-Enum level.
It would make lookups slower, but the index wouldn't be much bigger (it would
be slightly bigger because one would index everything w/o lowercasing).

> I'd love to see a formal syntax like this officially enter the Lucene
> standard query language someday.
>
> If someone can figure point me at how to do this without twiddling
> Lucene's code directly, I'd be happy to contribute the modification.

If you picked a token prefix/postfix that would pass through
the QueryParser w/o a syntax error, the necessary manipulation could
all be done in the Analyzer/TokenFilter.  Much easier, but perhaps not
as nice a syntax.

-Yonik

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org