lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Yonik Seeley (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (LUCENE-7407) Explore switching doc values to an iterator API
Date Tue, 18 Oct 2016 11:19:58 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-7407?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15585202#comment-15585202
] 

Yonik Seeley commented on LUCENE-7407:
--------------------------------------

bq. Hmm, in Lucene's nightly perf tests, the TermDateFacets got only a bit slower (~11%),
not 40% slower. Yonik, can you give more details on your benchmark so others can run it?

Amdahl's law? My tests are probably just isolating the docValues performance more.  These
are full-stack tests (on both sides?)... so it may be that TermDateFacets spends less of it's
execution time actually retrieving docValues, and has more bottlenecks elsewhere.  I'm also
effectively cutting out the query portion (finding the root domain) by reusing the same base
query each time (thus it will be cached).

Actually, if I test a field with a cardinality of 1M, the performance drop is on the order
of 12% for me too.  The biggest contributor is most likely a higher cost to find the top N
entries (the count array will have 1M entries) that is unrelated to the docvalues implementation.

As far as replicating some of these results... I think most of the relevant details (including
what exact queries look like) in SOLR-9599.
Probably one of the simplest to replicate at the lucene level is a sorting test:
{code}
http://localhost:8983/solr/collection1/query?q=*:*%20mydate_dt:NOW&fl=id&sort=s10_s%20desc,%20s100_s%20desc,%20s1000_s%20desc
{code}
So basically, do a really inexpensive query that covers pretty much all of the index, and
sorts by 3 fields (a field with a cardinality of 10, followed by a tiebreak with cardinality
100, followed by a tiebreak with cardinality 1000).  That helps isolate sorting-by-docvalue
performance.  I quickly tested this by hand, and it was 50% slower (I just ran it multiple
times and noted the lowest stable times).

bq. That's the wrong tradeoff, and we shouldn't let performance mess up our APIs that heavily.
Subjectively, I would chose the other trade-off as it's our job to be fast.  The previous
API wasn't bad... it just needed help with sparse values.


> Explore switching doc values to an iterator API
> -----------------------------------------------
>
>                 Key: LUCENE-7407
>                 URL: https://issues.apache.org/jira/browse/LUCENE-7407
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>              Labels: docValues
>             Fix For: master (7.0)
>
>         Attachments: LUCENE-7407.patch
>
>
> I think it could be compelling if we restricted doc values to use an
> iterator API at read time, instead of the more general random access
> API we have today:
>   * It would make doc values disk usage more of a "you pay for what
>     what you actually use", like postings, which is a compelling
>     reduction for sparse usage.
>   * I think codecs could compress better and maybe speed up decoding
>     of doc values, even in the non-sparse case, since the read-time
>     API is more restrictive "forward only" instead of random access.
>   * We could remove {{getDocsWithField}} entirely, since that's
>     implicit in the iteration, and the awkward "return 0 if the
>     document didn't have this field" would go away.
>   * We can remove the annoying thread locals we must make today in
>     {{CodecReader}}, and close the trappy "I accidentally shared a
>     single XXXDocValues instance across threads", since an iterator is
>     inherently "use once".
>   * We could maybe leverage the numerous optimizations we've done for
>     postings over time, since the two problems ("iterate over doc ids
>     and store something interesting for each") are very similar.
> This idea has come up many in the past, e.g. LUCENE-7253 is a recent
> example, and very early iterations of doc values started with exactly
> this ;)
> However, it's a truly enormous change, likely 7.0 only.  Or maybe we
> could have the new iterator APIs also ported to 6.x side by side with
> the deprecate existing random-access APIs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message