lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From John Byrne <>
Subject Re: case insensitivity
Date Thu, 26 Jun 2008 08:16:29 GMT
Chris Hostetter wrote:
> the enumeration is in lexigraphical order, so "Dell" is no where near 
> "dell" in the enumeration.  even if we added a boolean property to Terms 
> indicating that it's case insensitive Term the "seeking" along that 
> enumeration would be ... lss optimal ... then it can be now.
Ah, now I understand!
> : > > Let's say, for example, you want to find "Dell" (with a capital "D"), near
> : > > "computers" (with or without capitals, ie. in any case). The problem is
> : > > that
> : > > you would need to use a SpanQuery to find terms near each other; but if
> : > > the
> : > > case-sensitivity required is different for each term, then they will be in
> : > > different fields, making the use of SpanQuerys inpossible.
> i assume by this statement that you are suggesting that you want your
> users to be able to say "find me $foo near $bar where $foo must be in the
> case i specified but bar can be in any case" is that correct?
Yes, that's exactly what I meant.
> in that case Erick's point about indexing both the orriginal case and 
> some normalized casing at the same term position is the best way to go -- 
> the only downside this has compared to seperate fields is that it can 
> introduce some bias in your tf/idf values ... but that can be eliminated 
> by prefaxing all of your "normalized" terms with some unicode character 
> that your tokenizer would normally strip off.
 From Erick's reply:

"I suppose something like that might work, but I still think that presenting
a user with matches that sometimes work case sensitive and sometimes
doesn't would"

The user would, of course, choose which terms are case-sensitive when 
they query, using a modifier in the query language. (I would have to 
implement that). It's something my users have asked to be able to do -  
in their view, fields are something that should be used for different 
content, and case-sensitivity should be an option on *any* field. But 
what you have suggested should allow it to work that way, by adding both 
versions of the term at the same position.

Thanks guys!


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message