lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Michael McCandless (JIRA)" <>
Subject [jira] [Updated] (LUCENE-7407) Explore switching doc values to an iterator API
Date Sat, 17 Sep 2016 11:02:20 GMT


Michael McCandless updated LUCENE-7407:
    Attachment: LUCENE-7407.patch

I think this nearly ready!  I've fixed all nocommits, but {{ant precommit}} is a bit angry
still... I'll fix before pushing.

I'm attaching an applyable patch vs. current master.

All doc values usage has been switched to iterators instead of random access, and {{getDocsWithField}}
is gone.

I've done very little to improve the default codec to take advantage of this.  I think there
is a lot of fun improvements we can make here, in follow-on issues, so that e.g. LUCENE-7253
(merging of sparse doc values fields) is fixed.

To write doc values we now pass a {{DocValuesProducer}} (instead of N Iterables), and I created
legacy deprecated bridge classes ({{LegacyDocValuesIterables}}) to turns these back into Iterables
for existing codecs.

I also created legacy bridge classes to turn random access DVs into the new iterators.

> Explore switching doc values to an iterator API
> -----------------------------------------------
>                 Key: LUCENE-7407
>                 URL:
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>              Labels: docValues
>         Attachments: LUCENE-7407.patch
> I think it could be compelling if we restricted doc values to use an
> iterator API at read time, instead of the more general random access
> API we have today:
>   * It would make doc values disk usage more of a "you pay for what
>     what you actually use", like postings, which is a compelling
>     reduction for sparse usage.
>   * I think codecs could compress better and maybe speed up decoding
>     of doc values, even in the non-sparse case, since the read-time
>     API is more restrictive "forward only" instead of random access.
>   * We could remove {{getDocsWithField}} entirely, since that's
>     implicit in the iteration, and the awkward "return 0 if the
>     document didn't have this field" would go away.
>   * We can remove the annoying thread locals we must make today in
>     {{CodecReader}}, and close the trappy "I accidentally shared a
>     single XXXDocValues instance across threads", since an iterator is
>     inherently "use once".
>   * We could maybe leverage the numerous optimizations we've done for
>     postings over time, since the two problems ("iterate over doc ids
>     and store something interesting for each") are very similar.
> This idea has come up many in the past, e.g. LUCENE-7253 is a recent
> example, and very early iterations of doc values started with exactly
> this ;)
> However, it's a truly enormous change, likely 7.0 only.  Or maybe we
> could have the new iterator APIs also ported to 6.x side by side with
> the deprecate existing random-access APIs.

This message was sent by Atlassian JIRA

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message