lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chris Hostetter <hossman_luc...@fucit.org>
Subject Re: case insensitivity
Date Wed, 25 Jun 2008 21:39:16 GMT

: I imangined (and maybe I am over simplifying it!) that somewhere in the API
: there must be a string comparison using 'String.equals()' that determines if a
: document contains the term or not - and that use of 'equals()' has permanently
: locked Lucene into case-sensitive searching. The values being compared could
: be first lower-cased (or equalsIgnoreCase could be used) depending on the
: value of a boolean flag in the Term object.

You are over simplifying it a bit ... string comparisons are done in the 
internals, but not to compare a query "terms" to a document "terms" ... 
the index is inverted so there is a single enumeration of all indexed 
terms (regardless of which documents they are in) which maintain pointers 
to the docs that contained.  querying involves seeking along that 
enumeration to find the indexed term that corrisponds to the query term.

the enumeration is in lexigraphical order, so "Dell" is no where near 
"dell" in the enumeration.  even if we added a boolean property to Terms 
indicating that it's case insensitive Term the "seeking" along that 
enumeration would be ... lss optimal ... then it can be now.

: > > Let's say, for example, you want to find "Dell" (with a capital "D"), near
: > > "computers" (with or without capitals, ie. in any case). The problem is
: > > that
: > > you would need to use a SpanQuery to find terms near each other; but if
: > > the
: > > case-sensitivity required is different for each term, then they will be in
: > > different fields, making the use of SpanQuerys inpossible.

i assume by this statement that you are suggesting that you want your
users to be able to say "find me $foo near $bar where $foo must be in the
case i specified but bar can be in any case" is that correct?

in that case Erick's point about indexing both the orriginal case and 
some normalized casing at the same term position is the best way to go -- 
the only downside this has compared to seperate fields is that it can 
introduce some bias in your tf/idf values ... but that can be eliminated 
by prefaxing all of your "normalized" terms with some unicode character 
that your tokenizer would normally strip off.


-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message