Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm
Precedence: bulk
Reply-To: java-user@lucene.apache.org
Received-SPF: pass (athena.apache.org: domain of andre.rubin@gmail.com
 designates 209.85.217.13 as permitted sender)
DomainKey-Signature: a=rsa-sha1; c=nofws;
        d=gmail.com; s=gamma;
        h=message-id:date:from:to:subject:in-reply-to:mime-version
         :content-type:content-transfer-encoding:content-disposition
         :references;
        b=tPv0koZ0F8Fue1/tYFqwVJQiFzYwMpNce2it8yi7Qz98McMZeLVIqWx+MYuuMFf7+I
         GI+17C2mT8dwnodI4I/g8RuGV8jiIZvyZ42jMApUBvET08domzYeDrmYjbyaJmZFIvLh
         DwqdpJfeI4p5KReocCPbZR4TsCRB/aioex4ZU=
Message-ID: <efd1bd310808141516o7290fe44o892e5cffe34bd636@mail.gmail.com>
Date: Thu, 14 Aug 2008 15:16:27 -0700
From: "Andre Rubin" <andre.rubin@gmail.com>
To: java-user@lucene.apache.org
Subject: Re: Case Sensitivity
In-Reply-To: <359a92830808140817j1114861cw7c240b15bc149c28@mail.gmail.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: 7bit
Content-Disposition: inline
References: <48a3076a.2679420a.1c53.ffffa5c4@mx.google.com>
	 <op.ufvb07uq2ic4zu@sj-desktop>
	 <359a92830808140644u7e55d48ek3c048f2b05fd5345@mail.gmail.com>
	 <op.ufvryaly2ic4zu@sj-desktop>
	 <359a92830808140817j1114861cw7c240b15bc149c28@mail.gmail.com>

Sergey,

Based on a recent discussion I posted:
http://www.nabble.com/Searching-Tokenized-x-Un_tokenized-td18882569.html
, you cannot use Un_Tokenized because you can't have any analyzer run
thorugh it.

My suggestion, use a tokenized filed and a custom made Analyzer.
Haven't figure out all the details for you, but I think it's possible.

Andre

On Thu, Aug 14, 2008 at 8:17 AM, Erick Erickson <erickerickson@gmail.com> wrote:
> Be aware that StandardAnalyzer lowercases all the input,
> both at index and query times. Field.Store.YES will store
> the original text without any transformations, so doc.get(<field>)
> will return the original text. However, no matter what the
> Field.Store value, the *indexed* tokens (using
> TOKENIZED as you Field.Index.TOKENIZED)
> are passed through the analyzer.
>
> For instance, indexing "MIXed CasE  TEXT" in a
> field called "myfield" with Field.Store.YES,
> Field.Index.TOKENIZED would index the
> following tokens (with StandardAnalyzer).
> mixed
> case
> text
>
> and searches (with StandardAnalyzer) would match
> any case in the query terms (e.g. MIXED would hit,
> as would mixed as would CaSE).
>
> However, doc.get("myfield") would return
> "MIXed CasE  TEXT"
>
> As Doron said, though, a few use cases would
> help us provide better answers.
>
> Best
> Erick
>
>
> On Thu, Aug 14, 2008 at 10:31 AM, Sergey Kabashnyuk <ksmmlist@gmail.com>wrote:
>
>> Thanks for you  reply Erick.
>>
>>
>>  About the only way to do this that I know of is to
>>> index the data three times, once without any case
>>> changing, once uppercased and once lowercased.
>>> You'll have to watch your analyzer, probably making
>>> up your own (easily done, see the synonym analyzer
>>> in Lucene in Action).
>>>
>>> Your example doesn't tell us anything, since the critical
>>> information is the *analyzer* you use, both at query and
>>> at index times. The analyzer is responsible for any
>>> transformations, like case folding, tokenizing, etc.
>>>
>>
>>
>> In example  I want to show what I  stored field as  Field.Index.NO_NORMS
>>
>> As I understand it means what field contains original string
>> despite what analyzer I chose(StandardAnalyzer by default).
>>
>> All querys I made myself without using Parsers.
>> For example new TermQuery(new Term("filed", "MaMa"));
>>
>>
>> I agree with you about possible implementation,
>> but it increase size of index at times.
>>
>> But are there other possibilities, such as using  custom query, possibly
>> similar to  RegexQuery,RegexTermEnum that would compare terms
>> at it's  own discretion?
>>
>>
>>
>>
>>
>>> But what is your use-case for needing both upper and
>>> lower case comparisons? I have a hard time coming
>>> up with a reason to do both that wouldn't be satisfied
>>> by just a caseless search.
>>>
>>> Best
>>> Erick
>>>
>>> On Thu, Aug 14, 2008 at 4:47 AM, Sergey Kabashnyuk <ksmmlist@gmail.com
>>> >wrote:
>>>
>>>  Hello.
>>>>
>>>> I have the similar question.
>>>>
>>>> I need to implement
>>>> 1. Case sensitive search.
>>>> 2. Lower case search for concrete field.
>>>> 3. Upper case search for concrete filed.
>>>>
>>>> For now I use
>>>> new Field("PROPERTIES",
>>>>                  content,
>>>>                  Field.Store.NO,
>>>>                  Field.Index.NO_NORMS,
>>>>                  Field.TermVector.NO)
>>>> for original string and make case sensitive search.
>>>>
>>>> But does anyone have an idea to how implement second and third type of
>>>> search?
>>>>
>>>> Thanks
>>>>
>>>>
>>>>
>>>>  Hi All,
>>>>
>>>>> Once I index a bunch of documents with a StandardAnalyzer (and if the
>>>>> effort
>>>>> I need to put in to reindex the documents is not worth the effort), is
>>>>> there
>>>>> a way to search on the index without case sensitivity.
>>>>> I do not use any sophisticated Analyzer that makes use of
>>>>> LowerCaseTokenizer.
>>>>> Please let me know if there is a solution to circumvent this case
>>>>> sensitivity problem.
>>>>> Many thanks
>>>>> Dino
>>>>>
>>>>>
>>>>>  --
>>>> Sergey Kabashnyuk
>>>> eXo Platform SAS
>>>>
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>
>>>>
>>>>  --
>> Sergey Kabashnyuk
>> eXo Platform SAS
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org