lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "J.J. Larrea" <...@panix.com>
Subject Re: Sort fields shouldn't be tokenized
Date Mon, 16 Nov 2009 17:08:03 GMT
You can certainly use an analyzer chain to process the incoming text  
for a sort field, as long as a single Term emerges or as long as only  
the first Term is significant for sorting. I don't believe that the  
fact the field would have the tokenized flag set, makes any difference  
to the sort logic.

- J.J.

On Nov 16, 2009, at 11:38 AM, Jeff Plater wrote:

> Thanks - so if my sort field is a single term then I should be ok with
> using an analyzer (to lowercase it for example).
>
> -Jeff
>
> -----Original Message-----
> From: J.J. Larrea [mailto:jjl@panix.com]
> Sent: Monday, November 16, 2009 11:19 AM
> To: java-user@lucene.apache.org
> Subject: Re: Sort fields shouldn't be tokenized
>
> It's not universally true that a tokenized field cannot be used as a
> sort field, but it is true that you will not get the desired sort
> order except in special cases:
>
> Lucene's indexes of course contain inverted tables which map Term ->
> DocumentID, DocumentID, ...
> But for sorting, once a set of Document  IDs have been selected, the
> respective Term values are used as an ordering key.
> In order to do that, the first time a field is referenced for sorting
> a FieldCache table is allocated and pre-filled with Document -> Term
> mappings.
> For indexed text which is tokenized into multiple Terms, only the
> first one is retained.  This is done for efficiency concerns (lookup
> speed and memory utilization).
>
> So for say a title field you had indexed strings such as:
>
> The Turkey and its Predators
> Turkey Cooking made Easy
> Turkeys and their Discontent
>
> Assuming the typical analysis steps of case folding, stopword removal,
> depunctuation, depluralization, etc. the indexed Terms would be
> something on the order of:
>
> turkey / predator
> turkey / cooking / made / easy
> turkey / their / discontent
>
> but sorting would only use the initial token 'turkey' for the title
> field, and all such documents starting with turkey would be randomly
> (Document ID) ordered in the hitlist - subject of course to any
> subsequent sorting stages.  Which is likely NOT what you would want
> for title sorting.
>
> Rather, you would certainly want to retain case folding, and probably
> retain stopword removal and depunctuation and maybe depluralization
> (perhaps with the rules somewhat altered from the field variant used
> for searching), but turn off any tokenization, and an operations like
> synonym substitution/enhancement that could alter the sort order in
> user-unexpected ways.
>
> Does the proviso make more sense now?
>
> - J.J. Larrea
>
> On Nov 16, 2009, at 10:36 AM, Jeff Plater wrote:
>
>>
>> I am looking at adding some sorting functionality to my application
>> and
>> read that Sort fields should not be tokenized - can anyone explain
>> why?
>> I have code that is tokenizing the sort fields and it seems to be
>> working.  Is it just because some tokenizing can change the value
>> (like
>> remove stop words and such) which can produce an invalid sort order?
>> Thanks.
>>
>> -Jeff


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message