lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Andre Rubin" <andre.ru...@gmail.com>
Subject Re: Case Sensitivity
Date Thu, 14 Aug 2008 22:16:27 GMT
Sergey,

Based on a recent discussion I posted:
http://www.nabble.com/Searching-Tokenized-x-Un_tokenized-td18882569.html
, you cannot use Un_Tokenized because you can't have any analyzer run
thorugh it.

My suggestion, use a tokenized filed and a custom made Analyzer.
Haven't figure out all the details for you, but I think it's possible.

Andre

On Thu, Aug 14, 2008 at 8:17 AM, Erick Erickson <erickerickson@gmail.com> wrote:
> Be aware that StandardAnalyzer lowercases all the input,
> both at index and query times. Field.Store.YES will store
> the original text without any transformations, so doc.get(<field>)
> will return the original text. However, no matter what the
> Field.Store value, the *indexed* tokens (using
> TOKENIZED as you Field.Index.TOKENIZED)
> are passed through the analyzer.
>
> For instance, indexing "MIXed CasE  TEXT" in a
> field called "myfield" with Field.Store.YES,
> Field.Index.TOKENIZED would index the
> following tokens (with StandardAnalyzer).
> mixed
> case
> text
>
> and searches (with StandardAnalyzer) would match
> any case in the query terms (e.g. MIXED would hit,
> as would mixed as would CaSE).
>
> However, doc.get("myfield") would return
> "MIXed CasE  TEXT"
>
> As Doron said, though, a few use cases would
> help us provide better answers.
>
> Best
> Erick
>
>
> On Thu, Aug 14, 2008 at 10:31 AM, Sergey Kabashnyuk <ksmmlist@gmail.com>wrote:
>
>> Thanks for you  reply Erick.
>>
>>
>>  About the only way to do this that I know of is to
>>> index the data three times, once without any case
>>> changing, once uppercased and once lowercased.
>>> You'll have to watch your analyzer, probably making
>>> up your own (easily done, see the synonym analyzer
>>> in Lucene in Action).
>>>
>>> Your example doesn't tell us anything, since the critical
>>> information is the *analyzer* you use, both at query and
>>> at index times. The analyzer is responsible for any
>>> transformations, like case folding, tokenizing, etc.
>>>
>>
>>
>> In example  I want to show what I  stored field as  Field.Index.NO_NORMS
>>
>> As I understand it means what field contains original string
>> despite what analyzer I chose(StandardAnalyzer by default).
>>
>> All querys I made myself without using Parsers.
>> For example new TermQuery(new Term("filed", "MaMa"));
>>
>>
>> I agree with you about possible implementation,
>> but it increase size of index at times.
>>
>> But are there other possibilities, such as using  custom query, possibly
>> similar to  RegexQuery,RegexTermEnum that would compare terms
>> at it's  own discretion?
>>
>>
>>
>>
>>
>>> But what is your use-case for needing both upper and
>>> lower case comparisons? I have a hard time coming
>>> up with a reason to do both that wouldn't be satisfied
>>> by just a caseless search.
>>>
>>> Best
>>> Erick
>>>
>>> On Thu, Aug 14, 2008 at 4:47 AM, Sergey Kabashnyuk <ksmmlist@gmail.com
>>> >wrote:
>>>
>>>  Hello.
>>>>
>>>> I have the similar question.
>>>>
>>>> I need to implement
>>>> 1. Case sensitive search.
>>>> 2. Lower case search for concrete field.
>>>> 3. Upper case search for concrete filed.
>>>>
>>>> For now I use
>>>> new Field("PROPERTIES",
>>>>                  content,
>>>>                  Field.Store.NO,
>>>>                  Field.Index.NO_NORMS,
>>>>                  Field.TermVector.NO)
>>>> for original string and make case sensitive search.
>>>>
>>>> But does anyone have an idea to how implement second and third type of
>>>> search?
>>>>
>>>> Thanks
>>>>
>>>>
>>>>
>>>>  Hi All,
>>>>
>>>>> Once I index a bunch of documents with a StandardAnalyzer (and if the
>>>>> effort
>>>>> I need to put in to reindex the documents is not worth the effort), is
>>>>> there
>>>>> a way to search on the index without case sensitivity.
>>>>> I do not use any sophisticated Analyzer that makes use of
>>>>> LowerCaseTokenizer.
>>>>> Please let me know if there is a solution to circumvent this case
>>>>> sensitivity problem.
>>>>> Many thanks
>>>>> Dino
>>>>>
>>>>>
>>>>>  --
>>>> Sergey Kabashnyuk
>>>> eXo Platform SAS
>>>>
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>
>>>>
>>>>  --
>> Sergey Kabashnyuk
>> eXo Platform SAS
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message