lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jonathan Rochkind <rochk...@jhu.edu>
Subject WordDelimiter filter, expanding to multiple words, unexpected results
Date Tue, 02 Sep 2014 16:41:56 GMT
Hello, I'm running into a case where a query is not returning the 
results I expect, and I'm hoping someone can offer some explanation that 
might help me fine tune things or understand what's up.

I am running Solr 4.3.

My filter chain includes a WordDelimiterFilter and, later a filter that 
downcases everything for case-insensitive searching. It includes many 
other things too, but I think these are the pertinent facts.

For query "dELALAIN", the WordDelimiterFilter splits into:

text: d
start: 0
position: 1

text: ELALAIN
start: 1
position: 2

text: dELALAIN
start: 0
position: 2

Note the duplication/overlap of the tokens -- one version with "d" and 
"ELALAIN" split into two tokens, and another with just one token.

Later, all the tokens are lowercased by another filter in the chain. 
(actually an ICU filter which is doing something more complicated than 
just lowercasing, but I think we can consider it lowercasing for the 
purposes of this discussion).

If I understand right what the WordDelimiterFilter is trying to do here, 
it's probably doing something special because of the lowercase "d" 
followed by an uppercase letter, a special case for that. (I don't get 
this behavior with other mixed case queries not beginning with 'd').

And, what I think it's trying to do, is match text indexed as "d 
elalain" as well as text indexed by "delalain".

The problem is, it's not accomplishing that -- it is NOT matching text 
that was indexed as "delalain" (one token).

I don't entirely understand what the "position" attribute is for -- but 
I wonder if in this case, the position on "dELALAIN" is really supposed 
to be 1, not 2?  Could that be responsible for the bug?  Or is position 
irrelevant in this case?

If that's not it, then I'm at a loss as to what may be causing this bug 
-- or even if it's a bug at all, or I'm just not understanding intended 
behavior. I expect a query for "dELALAIN" to match text indexed as 
"delalain" (because of the forced lowercasing in the filter chain). But 
it's not doing so. Are my expectations wrong? Bug? Something else?

Thanks for any advice,

Jonathan

Mime
View raw message