lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Uwe Schindler (JIRA)" <j...@apache.org>
Subject [jira] Commented: (LUCENE-1372) Proposal: introduce more sensible sorting when a doc has multiple values for a term
Date Wed, 04 Mar 2009 07:41:56 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-1372?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12678623#action_12678623
] 

Uwe Schindler commented on LUCENE-1372:
---------------------------------------

For TrieRange the proposed variant to sort by the lowest term in TermEnum is absolutely fine.

Sorting against the first term in the document is simply impossible (maybe working if you
use the term positions during array creation, but this will slow down and it only works with
real tokenized fields, not fields like TrieRange).
TrieRange does not use String/StringIndex sorting, the ordering is done using the raw long/int
values. The arrays are filled and SortFields are instantiated using a custom FieldCache.Parser
(see LUCENE-1478). So if it is ordered by the lowest term (which is always the highest precision
one in TrieRange), the order would be correct.

In the current version, the results would be sorted using the last term in TermEnum, which
is the lowest precision. The order is then simply to unprecise (because the documents indexed
with TrieRange have the lower int/long bits stripped away).

The "simple" proposal is enough for trie range. Maybe we can add a option to switch between
first/last term (and make this option also available to SortFields and other parts where the
FieldCache is used).

> Proposal: introduce more sensible sorting when a doc has multiple values for a term
> -----------------------------------------------------------------------------------
>
>                 Key: LUCENE-1372
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1372
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Search
>    Affects Versions: 2.3.2
>            Reporter: Paul Cowan
>            Priority: Minor
>         Attachments: LUCENE-1372-MultiValueSorters.patch, lucene-multisort.patch
>
>
> At the moment, FieldCacheImpl has somewhat disconcerting values when sorting on a field
for which multiple values exist for one document. For example, imagine a field "fruit" which
is added to a document multiple times, with the values as follows:
> doc 1: {"apple"}
> doc 2: {"banana"}
> doc 3: {"apple", "banana"}
> doc 4: {"apple", "zebra"}
> if one sorts on the field "fruit", the loop in FieldCacheImpl.stringsIndexCache.createValue()
(and similarly for the other methods in the various FieldCacheImpl caches) does the following:
>           while (termDocs.next()) {
>             retArray[termDocs.doc()] = t;
>           }
> which means that we look over the terms in their natural order and, on each one, overwrite
retArray[doc] with the value for each document with that term. Effectively, this overwriting
means that a string sort in this circumstance will sort by the LAST term lexicographically,
so the docs above will effecitvely be sorted as if they had the single values ("apple", "banana",
"banana", "zebra") which is nonintuitive. To change this to sort on the first time in the
TermEnum seems relatively trivial and low-overhead; while it's not perfect (it's not local-aware,
for example) the behaviour seems much more sensible to me. Interested to see what people think.
> Patch to follow.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message