lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Doron Cohen" <cdor...@gmail.com>
Subject Re: Case Sensitivity
Date Sat, 16 Aug 2008 20:00:45 GMT
Hi Sergey, seems like case 4 and 5 are equivalent,
both meaning case insensitive right. Otherwise please
explain the difference.

If it is required to support both case sensitive
(cases 1,2,3) and case insensitive (case 4/5) then
both forms must be saved in the index - in two separate
fields (as Erick mentioned, I think).

Hope this helps,
Doron

On Fri, Aug 15, 2008 at 10:51 AM, Sergey Kabashnyuk <ksmmlist@gmail.com>wrote:

> Hello
>
> Here's my use case           content of the field
> Doc1 -
>        Field - "text " -   "Field Without Norms"
>
> Doc2 -
>        Field - "text " -   "field without norms"
>
> Doc3 -
>        Field - "text " -   "FIELD WITHOUT NORMS"
>
>
> Query                                     expected result
> 1. new Term("text","Field Without Norms")       doc1
> 2. new Term("text","field without norms")       doc2
> 3. new Term("text","FIELD WITHOUT NORMS")       doc3


> lowercase("text","field without norms")   doc1, doc2, doc3
> uppercase("text","FIELD WITHOUT NORMS")   doc1, doc2, doc3
>
> I stor "text" field like :
> new Field("text", Field.Store.NO, Field.Index.NO_NORMS,Field.TermVector.NO
> )
> using StandardAnalyzer and query  1-3 works perfectly as I need. The
> question is
> how create query 4-5?
>
> Thanks
>
> Sergey Kabashnyuk
> eXo Platform SAS
>
>
>  Be aware that StandardAnalyzer lowercases all the input,
>> both at index and query times. Field.Store.YES will store
>> the original text without any transformations, so doc.get(<field>)
>> will return the original text. However, no matter what the
>> Field.Store value, the *indexed* tokens (using
>> TOKENIZED as you Field.Index.TOKENIZED)
>> are passed through the analyzer.
>>
>> For instance, indexing "MIXed CasE  TEXT" in a
>> field called "myfield" with Field.Store.YES,
>> Field.Index.TOKENIZED would index the
>> following tokens (with StandardAnalyzer).
>> mixed
>> case
>> text
>>
>> and searches (with StandardAnalyzer) would match
>> any case in the query terms (e.g. MIXED would hit,
>> as would mixed as would CaSE).
>>
>> However, doc.get("myfield") would return
>> "MIXed CasE  TEXT"
>>
>> As Doron said, though, a few use cases would
>> help us provide better answers.
>>
>> Best
>> Erick
>>
>>
>> On Thu, Aug 14, 2008 at 10:31 AM, Sergey Kabashnyuk <ksmmlist@gmail.com
>> >wrote:
>>
>>  Thanks for you  reply Erick.
>>>
>>>
>>>  About the only way to do this that I know of is to
>>>
>>>> index the data three times, once without any case
>>>> changing, once uppercased and once lowercased.
>>>> You'll have to watch your analyzer, probably making
>>>> up your own (easily done, see the synonym analyzer
>>>> in Lucene in Action).
>>>>
>>>> Your example doesn't tell us anything, since the critical
>>>> information is the *analyzer* you use, both at query and
>>>> at index times. The analyzer is responsible for any
>>>> transformations, like case folding, tokenizing, etc.
>>>>
>>>>
>>>
>>> In example  I want to show what I  stored field as  Field.Index.NO_NORMS
>>>
>>> As I understand it means what field contains original string
>>> despite what analyzer I chose(StandardAnalyzer by default).
>>>
>>> All querys I made myself without using Parsers.
>>> For example new TermQuery(new Term("filed", "MaMa"));
>>>
>>>
>>> I agree with you about possible implementation,
>>> but it increase size of index at times.
>>>
>>> But are there other possibilities, such as using  custom query, possibly
>>> similar to  RegexQuery,RegexTermEnum that would compare terms
>>> at it's  own discretion?
>>>
>>>
>>>
>>>
>>>
>>>  But what is your use-case for needing both upper and
>>>> lower case comparisons? I have a hard time coming
>>>> up with a reason to do both that wouldn't be satisfied
>>>> by just a caseless search.
>>>>
>>>> Best
>>>> Erick
>>>>
>>>> On Thu, Aug 14, 2008 at 4:47 AM, Sergey Kabashnyuk <ksmmlist@gmail.com
>>>> >wrote:
>>>>
>>>>  Hello.
>>>>
>>>>>
>>>>> I have the similar question.
>>>>>
>>>>> I need to implement
>>>>> 1. Case sensitive search.
>>>>> 2. Lower case search for concrete field.
>>>>> 3. Upper case search for concrete filed.
>>>>>
>>>>> For now I use
>>>>> new Field("PROPERTIES",
>>>>>                 content,
>>>>>                 Field.Store.NO,
>>>>>                 Field.Index.NO_NORMS,
>>>>>                 Field.TermVector.NO)
>>>>> for original string and make case sensitive search.
>>>>>
>>>>> But does anyone have an idea to how implement second and third type of
>>>>> search?
>>>>>
>>>>> Thanks
>>>>>
>>>>>
>>>>>
>>>>>  Hi All,
>>>>>
>>>>>  Once I index a bunch of documents with a StandardAnalyzer (and if the
>>>>>> effort
>>>>>> I need to put in to reindex the documents is not worth the effort),
is
>>>>>> there
>>>>>> a way to search on the index without case sensitivity.
>>>>>> I do not use any sophisticated Analyzer that makes use of
>>>>>> LowerCaseTokenizer.
>>>>>> Please let me know if there is a solution to circumvent this case
>>>>>> sensitivity problem.
>>>>>> Many thanks
>>>>>> Dino
>>>>>>
>>>>>>
>>>>>>  --
>>>>>>
>>>>> Sergey Kabashnyuk
>>>>> eXo Platform SAS
>>>>>
>>>>>
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>>
>>>>>
>>>>>  --
>>>>>
>>>> Sergey Kabashnyuk
>>> eXo Platform SAS
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>
>>>
>>>
>
>
> --
> Using Opera's revolutionary e-mail client: http://www.opera.com/mail/
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message