lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Erick Erickson" <erickerick...@gmail.com>
Subject Re: case insensitivity
Date Wed, 25 Jun 2008 16:01:37 GMT
I suppose something like that might work, but I still think that presenting
a user with matches that sometimes work case sensitive and sometimes
doesn't would be...er..fraught.


If you can programmatically restrict your query construction and you're
*sure* this is what your users expect, you can make it work. Just index
each term twice, once lowercased and once in the native case, with 0
term increment between them. Then you can simply construct your
terms however you want and fire the result at the search. In fact, you
only need to double-index the terms you want to do case-sensitive
searches on. This will increase the size of your index less than you
think...

Best
Erick

On Wed, Jun 25, 2008 at 9:59 AM, John Byrne <john.byrne@propylon.com> wrote:

> What I had in mind was actually very simple: when you create a Term
> (programatically) you normally set the text and the field. I would also like
> to be able to set the case sensitivity to true or false for that specific
> Term object.
>
> I imangined (and maybe I am over simplifying it!) that somewhere in the API
> there must be a string comparison using 'String.equals()' that determines if
> a document contains the term or not - and that use of 'equals()' has
> permanently locked Lucene into case-sensitive searching. The values being
> compared could be first lower-cased (or equalsIgnoreCase could be used)
> depending on the value of a boolean flag in the Term object.
>
> If that option was there, there would be no need to ever change the case in
> the analyzer - you'd be able to control case-sensitivity regardless of the
> field used.
>
> Of course, I realize that there is currently no way to take advantage of
> such a feature in the QueryParser. It could only be done programatically.
> But I don't think that's a reason not to do it, since the API already has
> features that aren't implemented in the QueryParser (like SpanQuerys). In a
> perfect world, the parser would support all the features, but for the time
> being anyone who wants to take advantage of the newer features has to find
> an alternative anyway.
>
> The problem that it would solve for me is, as I mentioned, that I could mix
> case-sensitive Terms with case-insensitive Terms when using SpanQuerys. I
> currently have no way to do that.
>
> Regards,
> -John
>
> Erick Erickson wrote:
>
>> Well, it depends on what you mean by "per term". There's already
>> PerFieldAnalyzerWrapper for each field, but I don't think that's what
>> you want.
>>
>> How do you expect a per term analyzer to behave? I'm having a hard
>> time thinking of a use case that's general. You could always
>> roll your own analyzer that didn't change case for your particular
>> list of words.
>>
>> But the problem is your users. In your example, suppose a user
>> typed in "dell computers". Would that match "Dell computers"?
>> Does your analyzer automatically upper-case some words? If it
>> does, that's the same as lower casing them all. If it doesn't,
>> how do you explain that to your users?
>>
>> All in all, I'm having a tough time imagining how this would work.
>> It's easy enough to say "let's assume", but I suspect that
>> whatever solution satisfied your example will have its own problems
>> that are far worse than just lower-casing things.
>>
>> Best
>> Erick
>>
>>
>> On Wed, Jun 25, 2008 at 5:37 AM, John Byrne <john.byrne@propylon.com>
>> wrote:
>>
>>
>>
>>> Hi,
>>>
>>> I know that case-insensitive searching is normally done by creating an
>>> all-lower-case version of the documents, and turning the search terms
>>> into
>>> lower case whenever this field is searched, but this approach has it's
>>> disadvantages.
>>>
>>> Let's say, for example, you want to find "Dell" (with a capital "D"),
>>> near
>>> "computers" (with or without capitals, ie. in any case). The problem is
>>> that
>>> you would need to use a SpanQuery to find terms near each other; but if
>>> the
>>> case-sensitivity required is different for each term, then they will be
>>> in
>>> different fields, making the use of SpanQuerys inpossible.
>>>
>>> There might be ways to work around this, but my question is: will
>>> case-insensitvity ever be added to Lucene as per-Term option? If not, can
>>> anyone tell me where I should start looking in order to make this change
>>> myself?
>>>
>>> Thanks!
>>>
>>> -JB
>>>
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>
>>>
>>>
>>>
>>
>>  ------------------------------------------------------------------------
>>
>> No virus found in this incoming message.
>> Checked by AVG. Version: 7.5.524 / Virus Database: 270.4.1/1517 - Release
>> Date: 24/06/2008 20:41
>>
>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message