Return-Path: X-Original-To: apmail-lucene-java-user-archive@www.apache.org Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 3A7C510F67 for ; Tue, 25 Nov 2014 13:06:48 +0000 (UTC) Received: (qmail 91012 invoked by uid 500); 25 Nov 2014 13:06:41 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 90947 invoked by uid 500); 25 Nov 2014 13:06:41 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 90923 invoked by uid 99); 25 Nov 2014 13:06:41 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 25 Nov 2014 13:06:41 +0000 X-ASF-Spam-Status: No, hits=1.7 required=5.0 tests=FREEMAIL_ENVFROM_END_DIGIT,HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of dapurv5@gmail.com designates 209.85.223.174 as permitted sender) Received: from [209.85.223.174] (HELO mail-ie0-f174.google.com) (209.85.223.174) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 25 Nov 2014 13:06:36 +0000 Received: by mail-ie0-f174.google.com with SMTP id rl12so441548iec.5 for ; Tue, 25 Nov 2014 05:05:31 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :content-type; bh=Xtwva6lanK2RvPMbNtZr8UQ+1SCUo2MBWH9Ln8FYbjU=; b=nAeHflE5uyFQl+t5TI7F8DTFYB38SmlF3mr49o5q7FmaCx6VytlBBlxvmxlcyPH3VH iXUKbs5hf440ojez4Wfpm7kqOHCCBIAjh4WjPMiFVea67f5CAoZ23P+KjTgyooQSzisI j+1e+J0Hrdlkp9Ub+YSW5gR7r5zah0ZjBCyrPV+9+VsTID822WmPqruLKwLazaqBcoeI 0EHUjfE3SHscOOK/7J1c4X+EQFiwXoc+ki+7ksIcEewWrKSIDVTSjlKRuMrXngHdbtcu RpPPka08Ba+5GbayuhN+5VFyF6vsQBxhxRW1IJmwHGsxVe6T4odb07qPddW8VyMEuADZ 2uBg== X-Received: by 10.50.111.226 with SMTP id il2mr17006328igb.10.1416920731203; Tue, 25 Nov 2014 05:05:31 -0800 (PST) MIME-Version: 1.0 Received: by 10.64.113.34 with HTTP; Tue, 25 Nov 2014 05:05:11 -0800 (PST) In-Reply-To: <54747C7A.1070507@safaribooksonline.com> References: <928222731.427025.1416915922374.JavaMail.yahoo@jws10085.mail.ne1.yahoo.com> <2101675222.438627.1416916598669.JavaMail.yahoo@jws10039.mail.ne1.yahoo.com> <54747C7A.1070507@safaribooksonline.com> From: Apurv Verma Date: Tue, 25 Nov 2014 18:35:11 +0530 Message-ID: Subject: Re: Case Insensitive Matching in Solr/Lucene To: solr-user@lucene.apache.org, java-user , msokolov@safaribooksonline.com, Ahmet Arslan Content-Type: multipart/alternative; boundary=047d7b41443adbe7750508ae909c X-Virus-Checked: Checked by ClamAV on apache.org --047d7b41443adbe7750508ae909c Content-Type: text/plain; charset=ISO-8859-1 Hey Michael, Thanks for your reply. My use case is a little different. I would like to get the original values in facet queries but I would like to apply filter queries in a case insensitive fashion. For example I require facet_query to return Quick, The, brown, ... But I want filter queries of the form fq=Term:"quick" Also could you please point me to some additional links on how I can index different variants of a token at the same position? -- Regards, Apurv Verma On Tue, Nov 25, 2014 at 6:26 PM, Michael Sokolov < msokolov@safaribooksonline.com> wrote: > right -- missed Ahmet's answer there in my haste to respond ... > > -Mike > > > On 11/25/14 6:56 AM, Ahmet Arslan wrote: > >> Hi Apurv, >> >> I wouldn't worry about index size, increase in index size is not linear >> (2x) like that. >> Please see similar discussion : >> https://issues.apache.org/jira/browse/LUCENE-5620 >> >> Ahmet >> >> >> On Tuesday, November 25, 2014 1:46 PM, Ahmet Arslan >> wrote: >> >> >> >> Hi Apurv, >> >> You can create an additional field for case sensitive search, and then >> you can switch at query time. You will have two fields (text_ci and >> text_lower) with different analysers populated with copyField. >> >> Ahmet >> >> >> >> On Tuesday, November 25, 2014 1:39 PM, Apurv Verma >> wrote: >> Hey all, >> The standard solution to doing a case-insensitive match in lucene is to >> use a Lowercase filter at index and query time. However this does not >> preserve the content of the original document. For example if my inverted >> index is. >> >> Term Doc_1 Doc_2 >> ------------------------- >> Quick | | X >> The | X | >> brown | X | X >> dog | X | >> dogs | | X >> fox | X | >> foxes | | X >> in | | X >> jumped | X | >> lazy | X | X >> leap | | X >> over | X | X >> quick | X | >> summer | | X >> the | X | >> ------------------------ >> >> Is it possible to choose between case insensitive/ case sensitive match at >> query time. The index is stored in memory in solr. My question is, if this >> is stored as a hashmap with string key can I override the hashcode so that >> "Quick" and "quick" return the same hash value? >> >> Has anyone attempted this before? Is my assumption about index right? What >> would be the classes and code flow to look at? >> >> > --047d7b41443adbe7750508ae909c--