Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm
Precedence: bulk
Reply-To: java-user@lucene.apache.org
Received-SPF: pass (athena.apache.org: domain of apurv@bloomreach.com
 designates 209.85.223.182 as permitted sender)
MIME-Version: 1.0
From: Apurv Verma <apurv@bloomreach.com>
Date: Tue, 25 Nov 2014 17:05:37 +0530
Message-ID: 
 <CAH-3W3=eiEhMU=7J2CzP_xUfWuckdH2Gtz57z=cFnZgqtiT-BQ@mail.gmail.com>
Subject: Case Insensitive Matching in Solr/Lucene
To: java-user@lucene.apache.org, solr-user@lucene.apache.org
Content-Type: multipart/alternative; boundary=001a113ee8008f33140508ad5041

--001a113ee8008f33140508ad5041
Content-Type: text/plain; charset=UTF-8

Hey all,
 The standard solution to doing a case-insensitive match in lucene is to
use a Lowercase filter at index and query time. However this does not
preserve the content of the original document. For example if my inverted
index is.

Term      Doc_1  Doc_2
-------------------------
Quick   |       |  X
The     |   X   |
brown   |   X   |  X
dog     |   X   |
dogs    |       |  X
fox     |   X   |
foxes   |       |  X
in      |       |  X
jumped  |   X   |
lazy    |   X   |  X
leap    |       |  X
over    |   X   |  X
quick   |   X   |
summer  |       |  X
the     |   X   |
------------------------

Is it possible to choose between case insensitive/ case sensitive match at
query time. The index is stored in memory in solr. My question is, if this
is stored as a hashmap with string key can I override the hashcode so that
"Quick" and "quick" return the same hash value?

Has anyone attempted this before? Is my assumption about index right? What
would be the classes and code flow to look at?

-- 
Regards,
Apurv

--001a113ee8008f33140508ad5041--