lucene-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Greg Miller (Jira)" <>
Subject [jira] [Commented] (LUCENE-10033) Encode doc values in smaller blocks of values, like postings
Date Sat, 31 Jul 2021 00:43:00 GMT


Greg Miller commented on LUCENE-10033:

{quote}I like this hybrid idea! Maybe another idea (which is hybrid too!) would consist of
doing part of the decoding for the entire block and part of the decoding on a per-value basis.
E.g. in the case of GCD compression, maybe we could do the bit unpacking for the entire block
but, but only apply the multiplicative factor when fetching a single value.

Oh yeah, that's a really interesting way to approach it as well! I wonder how it would perform
to always delay the application of GCD and/or the min delta addition. It would be nice if
no heuristics were needed to decide whether-or-not to apply for the whole block up-front vs.
delay, but I think this code is auto-vectorizing when applying across the whole block? So
maybe some inefficiency in the "dense" cases if done on-demand. Hmm... cool idea!

{quote}I've tried to push some more optimizations by reusing ForUtil for low numbers of bits
per value and only subtracting the min value when it would likely save space ...

I'll see if I can pull your latest code down early next week and re-run our internal benchmarks
to see if there's much of a difference.

> Encode doc values in smaller blocks of values, like postings
> ------------------------------------------------------------
>                 Key: LUCENE-10033
>                 URL:
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Adrien Grand
>            Priority: Minor
>          Time Spent: 40m
>  Remaining Estimate: 0h
> This is a follow-up to the discussion on this thread:
> Our current approach for doc values uses large blocks of 16k values where values can
be decompressed independently, using DirectWriter/DirectReader. This is a bit inefficient
in some cases, e.g. a single outlier can grow the number of bits per value for the entire
block, we can't easily use run-length compression, etc. Plus, it encourages using a different
sub-class for every compression technique, which puts pressure on the JVM.
> We'd like to move to an approach that would be more similar to postings with smaller
blocks (e.g. 128 values) whose values get all decompressed at once (using SIMD instructions),
with skip data within blocks in order to efficiently skip to arbitrary doc IDs (or maybe still
use jump tables as today's doc values, and as discussed here for postings:

This message was sent by Atlassian Jira

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message