lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Paul Cowan (JIRA)" <>
Subject [jira] Commented: (LUCENE-1372) Proposal: introduce more sensible sorting when a doc has multiple values for a term
Date Wed, 04 Mar 2009 09:13:56 GMT


Paul Cowan commented on LUCENE-1372:

Yes, sorry, I might have been unclear. When I referred to 'first term' I meant 'the first
term lexicographically' -- at least as far as binary order is 'lexicographically' -- i.e.
the 'lowest' term.

I like the idea of the pluggable behaviour, even if it's a simple boolean:

boolean sortByLowestTerm = ...

if (retArray[termDocs.doc() == null || !sortByLowestTerm) {
   retArray[termDocs.doc()] = termval;

We could replace this with a pluggable 'TermSelectionPolicy' or somesuch (as suggested by
Earwin on java-dev@).... perhaps something like

interface SortTermCollector {
  void addTermText(String text);
  Comparable toSortValue();

and then use a SortTermCollector[maxDoc] in the field cache, then iterate over the array at
the end to convert the SortTermCollectors into Comparables (or make them directly comparable).
Implementation of addTermText would be trivial for the first and last behaviour ("if (sortValue
!= null) sortValue = text" and "sortValue = text") respectively but we could use it for our
'full alphabetical ordering', it could perform functions on the terms as Earwin mentions,
etc. This may or may not be overkill.

I'm happy to try and get the changes you'd like for TrieRange, because they're an almost-but-not-quite-acceptable
compromise for us (we're using a patched version of Lucene that does this now), but I'm content
to use our own class internally, happy if we can expose the DEFAULT_PARSER implementations
(and anything else -- my class sits in the same package so rebasing it may expose other things
that need to be made protected etc) -- and anything beyond that (landing it in contrib or
core) would be brilliant.

My two proposals certainly aren't mutually exclusive, they don't really touch each other.

> Proposal: introduce more sensible sorting when a doc has multiple values for a term
> -----------------------------------------------------------------------------------
>                 Key: LUCENE-1372
>                 URL:
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Search
>    Affects Versions: 2.3.2
>            Reporter: Paul Cowan
>            Priority: Minor
>         Attachments: LUCENE-1372-MultiValueSorters.patch, lucene-multisort.patch
> At the moment, FieldCacheImpl has somewhat disconcerting values when sorting on a field
for which multiple values exist for one document. For example, imagine a field "fruit" which
is added to a document multiple times, with the values as follows:
> doc 1: {"apple"}
> doc 2: {"banana"}
> doc 3: {"apple", "banana"}
> doc 4: {"apple", "zebra"}
> if one sorts on the field "fruit", the loop in FieldCacheImpl.stringsIndexCache.createValue()
(and similarly for the other methods in the various FieldCacheImpl caches) does the following:
>           while ( {
>             retArray[termDocs.doc()] = t;
>           }
> which means that we look over the terms in their natural order and, on each one, overwrite
retArray[doc] with the value for each document with that term. Effectively, this overwriting
means that a string sort in this circumstance will sort by the LAST term lexicographically,
so the docs above will effecitvely be sorted as if they had the single values ("apple", "banana",
"banana", "zebra") which is nonintuitive. To change this to sort on the first time in the
TermEnum seems relatively trivial and low-overhead; while it's not perfect (it's not local-aware,
for example) the behaviour seems much more sensible to me. Interested to see what people think.
> Patch to follow.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message