lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jeff Plater" <jpla...@healthmarketscience.com>
Subject RE: Sort fields shouldn't be tokenized
Date Mon, 16 Nov 2009 16:38:15 GMT
Thanks - so if my sort field is a single term then I should be ok with
using an analyzer (to lowercase it for example).

-Jeff

-----Original Message-----
From: J.J. Larrea [mailto:jjl@panix.com] 
Sent: Monday, November 16, 2009 11:19 AM
To: java-user@lucene.apache.org
Subject: Re: Sort fields shouldn't be tokenized

It's not universally true that a tokenized field cannot be used as a  
sort field, but it is true that you will not get the desired sort  
order except in special cases:

Lucene's indexes of course contain inverted tables which map Term ->  
DocumentID, DocumentID, ...
But for sorting, once a set of Document  IDs have been selected, the  
respective Term values are used as an ordering key.
In order to do that, the first time a field is referenced for sorting  
a FieldCache table is allocated and pre-filled with Document -> Term  
mappings.
For indexed text which is tokenized into multiple Terms, only the  
first one is retained.  This is done for efficiency concerns (lookup  
speed and memory utilization).

So for say a title field you had indexed strings such as:

The Turkey and its Predators
Turkey Cooking made Easy
Turkeys and their Discontent

Assuming the typical analysis steps of case folding, stopword removal,  
depunctuation, depluralization, etc. the indexed Terms would be  
something on the order of:

turkey / predator
turkey / cooking / made / easy
turkey / their / discontent

but sorting would only use the initial token 'turkey' for the title  
field, and all such documents starting with turkey would be randomly  
(Document ID) ordered in the hitlist - subject of course to any  
subsequent sorting stages.  Which is likely NOT what you would want  
for title sorting.

Rather, you would certainly want to retain case folding, and probably  
retain stopword removal and depunctuation and maybe depluralization  
(perhaps with the rules somewhat altered from the field variant used  
for searching), but turn off any tokenization, and an operations like  
synonym substitution/enhancement that could alter the sort order in  
user-unexpected ways.

Does the proviso make more sense now?

- J.J. Larrea

On Nov 16, 2009, at 10:36 AM, Jeff Plater wrote:

>
> I am looking at adding some sorting functionality to my application  
> and
> read that Sort fields should not be tokenized - can anyone explain  
> why?
> I have code that is tokenizing the sort fields and it seems to be
> working.  Is it just because some tokenizing can change the value  
> (like
> remove stop words and such) which can produce an invalid sort order?
> Thanks.
>
> -Jeff
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message