Return-Path: Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: (qmail 30128 invoked from network); 11 May 2007 21:19:28 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 11 May 2007 21:19:28 -0000 Received: (qmail 58433 invoked by uid 500); 11 May 2007 21:19:28 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 58404 invoked by uid 500); 11 May 2007 21:19:28 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 58392 invoked by uid 99); 11 May 2007 21:19:28 -0000 Received: from herse.apache.org (HELO herse.apache.org) (140.211.11.133) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 11 May 2007 14:19:28 -0700 X-ASF-Spam-Status: No, hits=-0.0 required=10.0 tests=SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (herse.apache.org: domain of yseeley@gmail.com designates 64.233.166.178 as permitted sender) Received: from [64.233.166.178] (HELO py-out-1112.google.com) (64.233.166.178) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 11 May 2007 14:19:20 -0700 Received: by py-out-1112.google.com with SMTP id a25so880794pyi for ; Fri, 11 May 2007 14:19:00 -0700 (PDT) DKIM-Signature: a=rsa-sha1; c=relaxed/relaxed; d=gmail.com; s=beta; h=domainkey-signature:received:received:message-id:date:from:sender:to:subject:in-reply-to:mime-version:content-type:content-transfer-encoding:content-disposition:references:x-google-sender-auth; b=Z1yqKHoVJ4xeDKVDmRyKd/AyUjHk7B9o71PQbUXEtLtUXEoF69DRMpz1R2tx3CvCk4XOvFY201uimcatS1wbw30AMK0IpSP2Oh2slA5/hrLwoUqMJjWgfWOnPReCzH/AhSMtfIVk0mW2CFPVaa6jvw9zyEzctqhl5z8xoMOSLUo= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=beta; h=received:message-id:date:from:sender:to:subject:in-reply-to:mime-version:content-type:content-transfer-encoding:content-disposition:references:x-google-sender-auth; b=TlG0X8GMyUC0Ww0t+6WDqVnrrgMlhBX+hbWHcLNUOQmvlV//QycIj17xQlzL2vk0fVIhOje+CfIv1nxEnj9Lzo1wZND7f3ZoLY9DHXImG6xxK/yQCMgR49Z5XdXYy9qOUG1jD2on+2g4tvMPYdKrZ8n5/RhhNfu9B3xlIOlHvxw= Received: by 10.35.27.2 with SMTP id e2mr5635421pyj.1178918340160; Fri, 11 May 2007 14:19:00 -0700 (PDT) Received: by 10.35.98.7 with HTTP; Fri, 11 May 2007 14:19:00 -0700 (PDT) Message-ID: Date: Fri, 11 May 2007 17:19:00 -0400 From: "Yonik Seeley" Sender: yseeley@gmail.com To: java-user@lucene.apache.org Subject: Re: Mixing Case and Case-Insensitive Searching In-Reply-To: MIME-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Content-Disposition: inline References: X-Google-Sender-Auth: ca3479fbd5fee8a1 X-Virus-Checked: Checked by ClamAV on apache.org On 5/11/07, Walt Stoneburner wrote: > In this tutorial he stresses not once, not twice, but three times that > the same Analyzer that is used to build an index -must- also be used > when performing a Query. There is great detail explaining why this is > so. > > However, in order to get our magic to work, we need to violate this > rule in a very clever way. Yeah, "compatible" analyzer would be a better way to put it. Using the same analyzer for anything that produces multiple tokens at the same position is normally wrong. Solr allows specification of a "query" analyzer and an "index" analyzer for these cases. > STEP ONE: Building an index that has both case-sensitive and > case-insensitive tokens in it. Yep, your approach sounds fine, and will work in phrase queries (which the two-field solution currently can't handle). The greater difficulty lies in making it generic (working for many analyzers, etc). > This step is where things get complicated. It turns out that > StandardAnalyzer, which uses the StandardTokenizer, throws away dollar > signs. So, it doesn't matter how many you type in your query, they > all vanish, never giving you the opportunity to do anything with them > downstream. This points out the difficulty of doing this in a *generic* way. Better than a "$" would be a flag on the Token IMO. Not currently really supported by lucene, but you could perhaps subclass Token. > Bringing it all together, it's now possible to user your new query > version token analyzer with the QueryParser. And calling .parse() > with dollar sign prefixed strings will search for exact-case matches, > where omitting it works like the regular old Lucene we all know and > love. > > The down side...? The index has twice as many tokens. I've also considered case-insensitive support at the Term-Enum level. It would make lookups slower, but the index wouldn't be much bigger (it would be slightly bigger because one would index everything w/o lowercasing). > I'd love to see a formal syntax like this officially enter the Lucene > standard query language someday. > > If someone can figure point me at how to do this without twiddling > Lucene's code directly, I'd be happy to contribute the modification. If you picked a token prefix/postfix that would pass through the QueryParser w/o a syntax error, the necessary manipulation could all be done in the Analyzer/TokenFilter. Much easier, but perhaps not as nice a syntax. -Yonik --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org For additional commands, e-mail: java-user-help@lucene.apache.org