lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Robert Muir <rcm...@gmail.com>
Subject Re: DocValues space usage
Date Tue, 09 Apr 2013 15:51:24 GMT
On Tue, Apr 9, 2013 at 8:22 AM, Wei Wang <welshwang@gmail.com> wrote:

> DocValues makes fast per doc value lookup possible, which is nice. But it
> brings other interesting issues.
>
> Assume there are 100M docs and 200 NumericDocValuesFields, this ends up
> with huge number of disk and memory usage, even if there are just thousands
> of values for each field. I guess this is because Lucene stores a value for
> each DocValues field of each document, with variable-length codec.
>
> So in such scenario, is it possible only store values for the DocValues
> field of the docment that actually has a value for that field? Or does
> Lucene has a column storage mechanism sort of like hash map for DocValues:
>

This really depends on the details of the Codec's encoding. So if its
important to you the easiest way is to write a codec that compresses things
the way you want (e.g. uses two-stage table/block RLE/something like that).
Maybe this would be a good contribution to add to the codecs/ module of
lucene so other people could use it too.

I think some of the codecs might do this for some types: maybe just as an
accidental side effect of their current compression/encoding (e.g. use of
BlockPackedWriter). But its not something we really optimize for: as it
doesn't make sense for a lot of use cases like scoring factors or faceting
for docvalues. For example if you want to facet on tons of sparse fields,
its probably better to use lucene's faceting module with uses one combined
docvalues field for the document... I think.

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message