lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Erick Erickson <erickerick...@gmail.com>
Subject Re: Is docvalue sorted by value?
Date Mon, 05 Mar 2018 22:49:22 GMT
I think there are two issues here that are being conflated
1> _within_ a document, i.e. for a multi-valued field the values are
stored as Dominik says as a SORTED_SET. Not only will they be returned
(if you return from docValues rather than stored) in lexical order,
but identical values will be collapsed

2> across multiple documents, the question about  "...persisted with
order of values, not document id..." really makes no sense. The point
of DocValues is to answer the question "for document X what is the
value of field Y". X here is the _internal_ document ID. Now consider
a search. There are two documents that are hits, doc 35 and doc 198
(internal lucene doc ID). To sort them by field Y you have to know
what the value in that field is for those two docs is. How would
"pre-ordering" the values help here? If I have the _values_ in order,
I have no clue what docs are associated with them. That question is
what the "inverted index" is there to answer.

So I have doc 35 and 198. Think of DocValues as a large array indexed
by internal doc id. To know how these two docs sort all I have to do
is index into the array. It's slightly more complicated than that, but
conceptually that's what happens.

Best,
Erick

On Mon, Mar 5, 2018 at 11:29 AM, Dominik Safaric
<dominiksafaric@gmail.com> wrote:
>> So, can doc values be persisted with order of values, not document id? This should
be fast in sort scenario that the values are pre-ordered instead of scan/sort at runtime.
>
>
> No, unfortunately doc values cannot be persisted in order. Lucene stores this values
internally as a DocValuesType.SORTED_SET, where the values are being stored using for example
Long.compareTo().
>
> If you'd like to retrieve the values in insertion order, use stored instead of doc values
instead of. Then you might access the values in order using the LeafReader's document function.
However, beware that may induce performance issues because it requires loading the document
from disk.
>
> If you require to store and retrieve multiple numeric values per document in order, you
might consider using PointValues. PointValues are internally indexed with KD-trees. But, beware
that PointValues have a limited dimensionality, in terms that you can for example store values
in 8 dimensions, each of max 16 bytes.
>
>> On 5 Mar 2018, at 15:33, Tony Ma <tma@opentext.com> wrote:
>>
>> Per my understanding, doc values (binary doc values / numeric doc values) are stored
with sequence of document id. Sorted numeric doc values just means if a document has multiple
values, the values will be sorted for same document, but for different documents, the value
is still ordered by document id. Is that true?
>> So, can doc values be persisted with order of values, not document id? This should
be fast in sort scenario that the values are pre-ordered instead of scan/sort at runtime.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message