lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Dino Korah" <dcko...@gmail.com>
Subject RE: Case Sensitivity
Date Tue, 19 Aug 2008 12:57:56 GMT
Hi Guys,

>From the discussion here what I could understand was, if I am using
StandardAnalyzer on TOKENIZED fields, for both Indexing and Querying, I
shouldn't have any problems with cases. But if I have any UN_TOKENIZED
fields there will be problems if I do not case-normalize them myself before
adding them as a field to the document.

In my case I have a mixed scenario. I am indexing emails and the email
addresses are indexed UN_TOKENIZED. I do have a second set of custom
tokenized field, which keep the tokens in individual fields with same name.

For example, if the email had a from address "John Smith"
<J.Smith@world.net>, my document looks like this

------------------8<----------------
to: ...                       - UN_TOKENIZED
from: J.Smith@world.net       - UN_TOKENIZED
From-tokenized: John          - UN_TOKENIZED
From-tokenized: Smith         - UN_TOKENIZED
From-tokenized: J             - UN_TOKENIZED
From-tokenized: Smith         - UN_TOKENIZED
From-tokenized: world.net     - UN_TOKENIZED
From-tokenized: world         - UN_TOKENIZED
From-tokenized: net           - UN_TOKENIZED
Subject: ...                  - TOKENIZED
Body: ...                     - TOKENIZED
------------------8<----------------

Does it mean that where ever I use UN_TOKENIZED, they do not get through the
StandardAnalyzer before getting Indexed, but they do when they are searched
on? If that is the case, Do I need to normalise them before adding to
document?

I also would like to know if it is better to employ an EmailAnalyzer that
makes a TokenStream out of the given email address, rather than using a
simplistic function that gives me a list of string pieces and adding them
one by one. With searches, would both the approaches give same result?

Many thanks,
Dino



-----Original Message-----
From: Doron Cohen [mailto:cdoronc@gmail.com] 
Sent: 16 August 2008 21:01
To: java-user@lucene.apache.org
Subject: Re: Case Sensitivity

Hi Sergey, seems like case 4 and 5 are equivalent, both meaning case
insensitive right. Otherwise please explain the difference.

If it is required to support both case sensitive (cases 1,2,3) and case
insensitive (case 4/5) then both forms must be saved in the index - in two
separate fields (as Erick mentioned, I think).

Hope this helps,
Doron

On Fri, Aug 15, 2008 at 10:51 AM, Sergey Kabashnyuk
<ksmmlist@gmail.com>wrote:

> Hello
>
> Here's my use case           content of the field
> Doc1 -
>        Field - "text " -   "Field Without Norms"
>
> Doc2 -
>        Field - "text " -   "field without norms"
>
> Doc3 -
>        Field - "text " -   "FIELD WITHOUT NORMS"
>
>
> Query                                     expected result
> 1. new Term("text","Field Without Norms")       doc1
> 2. new Term("text","field without norms")       doc2
> 3. new Term("text","FIELD WITHOUT NORMS")       doc3


> lowercase("text","field without norms")   doc1, doc2, doc3
> uppercase("text","FIELD WITHOUT NORMS")   doc1, doc2, doc3
>
> I stor "text" field like :
> new Field("text", Field.Store.NO, 
> Field.Index.NO_NORMS,Field.TermVector.NO
> )
> using StandardAnalyzer and query  1-3 works perfectly as I need. The 
> question is how create query 4-5?
>
> Thanks
>
> Sergey Kabashnyuk
> eXo Platform SAS
>
>
>  Be aware that StandardAnalyzer lowercases all the input,
>> both at index and query times. Field.Store.YES will store the 
>> original text without any transformations, so doc.get(<field>) will 
>> return the original text. However, no matter what the Field.Store 
>> value, the *indexed* tokens (using TOKENIZED as you 
>> Field.Index.TOKENIZED) are passed through the analyzer.
>>
>> For instance, indexing "MIXed CasE  TEXT" in a field called "myfield" 
>> with Field.Store.YES, Field.Index.TOKENIZED would index the following 
>> tokens (with StandardAnalyzer).
>> mixed
>> case
>> text
>>
>> and searches (with StandardAnalyzer) would match any case in the 
>> query terms (e.g. MIXED would hit, as would mixed as would CaSE).
>>
>> However, doc.get("myfield") would return "MIXed CasE  TEXT"
>>
>> As Doron said, though, a few use cases would help us provide better 
>> answers.
>>
>> Best
>> Erick
>>
>>
>> On Thu, Aug 14, 2008 at 10:31 AM, Sergey Kabashnyuk 
>> <ksmmlist@gmail.com
>> >wrote:
>>
>>  Thanks for you  reply Erick.
>>>
>>>
>>>  About the only way to do this that I know of is to
>>>
>>>> index the data three times, once without any case changing, once 
>>>> uppercased and once lowercased.
>>>> You'll have to watch your analyzer, probably making up your own 
>>>> (easily done, see the synonym analyzer in Lucene in Action).
>>>>
>>>> Your example doesn't tell us anything, since the critical 
>>>> information is the *analyzer* you use, both at query and at index 
>>>> times. The analyzer is responsible for any transformations, like 
>>>> case folding, tokenizing, etc.
>>>>
>>>>
>>>
>>> In example  I want to show what I  stored field as  
>>> Field.Index.NO_NORMS
>>>
>>> As I understand it means what field contains original string despite 
>>> what analyzer I chose(StandardAnalyzer by default).
>>>
>>> All querys I made myself without using Parsers.
>>> For example new TermQuery(new Term("filed", "MaMa"));
>>>
>>>
>>> I agree with you about possible implementation, but it increase size 
>>> of index at times.
>>>
>>> But are there other possibilities, such as using  custom query, 
>>> possibly similar to  RegexQuery,RegexTermEnum that would compare 
>>> terms at it's  own discretion?
>>>
>>>
>>>
>>>
>>>
>>>  But what is your use-case for needing both upper and
>>>> lower case comparisons? I have a hard time coming up with a reason 
>>>> to do both that wouldn't be satisfied by just a caseless search.
>>>>
>>>> Best
>>>> Erick
>>>>
>>>> On Thu, Aug 14, 2008 at 4:47 AM, Sergey Kabashnyuk 
>>>> <ksmmlist@gmail.com
>>>> >wrote:
>>>>
>>>>  Hello.
>>>>
>>>>>
>>>>> I have the similar question.
>>>>>
>>>>> I need to implement
>>>>> 1. Case sensitive search.
>>>>> 2. Lower case search for concrete field.
>>>>> 3. Upper case search for concrete filed.
>>>>>
>>>>> For now I use
>>>>> new Field("PROPERTIES",
>>>>>                 content,
>>>>>                 Field.Store.NO,
>>>>>                 Field.Index.NO_NORMS,
>>>>>                 Field.TermVector.NO) for original string and make 
>>>>> case sensitive search.
>>>>>
>>>>> But does anyone have an idea to how implement second and third 
>>>>> type of search?
>>>>>
>>>>> Thanks
>>>>>
>>>>>
>>>>>
>>>>>  Hi All,
>>>>>
>>>>>  Once I index a bunch of documents with a StandardAnalyzer (and if 
>>>>> the
>>>>>> effort
>>>>>> I need to put in to reindex the documents is not worth the 
>>>>>> effort), is there a way to search on the index without case 
>>>>>> sensitivity.
>>>>>> I do not use any sophisticated Analyzer that makes use of 
>>>>>> LowerCaseTokenizer.
>>>>>> Please let me know if there is a solution to circumvent this case

>>>>>> sensitivity problem.
>>>>>> Many thanks
>>>>>> Dino
>>>>>>
>>>>>>
>>>>>>  --
>>>>>>
>>>>> Sergey Kabashnyuk
>>>>> eXo Platform SAS
>>>>>
>>>>>
>>>>> ------------------------------------------------------------------
>>>>> --- To unsubscribe, e-mail: 
>>>>> java-user-unsubscribe@lucene.apache.org
>>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>>
>>>>>
>>>>>  --
>>>>>
>>>> Sergey Kabashnyuk
>>> eXo Platform SAS
>>>
>>> --------------------------------------------------------------------
>>> - To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>
>>>
>>>
>
>
> --
> Using Opera's revolutionary e-mail client: http://www.opera.com/mail/
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message