lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Cat Bieber <cbie...@techtarget.com>
Subject String ordering appears different with sort vs range query
Date Thu, 19 Apr 2012 19:39:40 GMT
I'm trying to use a Solr query to find the next title in alphabetical 
order after a given string. The issue I'm facing is that the sort param 
seems to sort non-alphanumeric characters in a different order from the 
ordering used by a range filter in the q or fq param. I can't filter the 
non-alphanumeric characters out because they're integral to the data and 
it would not be a useful ordering if it were based only on the 
alphanumeric portion of the strings.

I'm running Solr version 3.5.

In my current approach, I have a field that is a unique string for each 
document:

<fieldType name="lowerCaseSort" class="solr.TextField" 
sortMissingLast="true" omitNorms="true">
<analyzer>
<charFilter class="solr.HTMLStripCharFilterFactory"/>
<tokenizer class="solr.KeywordTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.TrimFilterFactory"/>
</analyzer>
</fieldType>

<field name="uniqueSortString" type="lowerCaseSort" indexed="true" 
stored="true"/>

I'm passing the value for the current document in a range to query 
everything after the current string, sorted ascending:

/select?fl=uniqueSortString&sort=uniqueSortString+asc&q=uniqueSortString:["$1+ZX+Spectrum+HOBETA+format+file"+TO+*]&wt=xml&rows=5&version=2.2

In theory, I expect the first result to be the current item and the 
second result to be the next one. However, I'm finding that the sort and 
the range filter seem to use different ordering:

<result name="response" numFound="448" start="0">
<doc>
<str name="uniqueSortString">$1 ZX Spectrum - Emulator</str>
</doc>
<doc>
<str name="uniqueSortString">$1 ZX Spectrum HOBETA format file</str>
</doc>
<doc>
<str name="uniqueSortString">$1 ZX Spectrum Hobetta Picture Format</str>
</doc>
<doc>
<str name="uniqueSortString">$? TR-DOS ZX Spectrum file in HOBETA 
format</str>
</doc>
<doc>
<str name="uniqueSortString">$A AutoCAD Autosave File ( Autodesk Inc.)</str>
</doc>
</result>

Based on the results ordering, sort believes - precedes H, but the range 
filter should have excluded that first result if it ordered in the same 
way. Digging through the code, I think it looks like sorting uses 
String.compareTo() for ordering on a text/string field. However I 
haven't been able to track down where the range filter code is. If 
someone can point me in the right direction to find that code I'd love 
to look through it. Or, if anyone has suggestions regarding a different 
approach or changes I can make to this query/field, that would be very 
helpful.

Thanks for your time.
-Cat Bieber

Mime
View raw message