lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Wei Wang <welshw...@gmail.com>
Subject Re: DocValues space usage
Date Tue, 09 Apr 2013 16:06:59 GMT
Thanks for the hint. Could you point to some Codec that might do this for
some types, even just as an side effect as you mentioned? It will be
helpful to have something to start with.

And could you elaborate a bit more for "the facet on tons of sparse
fields"? I just got a vague idea from the comments.

On Tue, Apr 9, 2013 at 8:51 AM, Robert Muir <rcmuir@gmail.com> wrote:

> On Tue, Apr 9, 2013 at 8:22 AM, Wei Wang <welshwang@gmail.com> wrote:
>
> > DocValues makes fast per doc value lookup possible, which is nice. But it
> > brings other interesting issues.
> >
> > Assume there are 100M docs and 200 NumericDocValuesFields, this ends up
> > with huge number of disk and memory usage, even if there are just
> thousands
> > of values for each field. I guess this is because Lucene stores a value
> for
> > each DocValues field of each document, with variable-length codec.
> >
> > So in such scenario, is it possible only store values for the DocValues
> > field of the docment that actually has a value for that field? Or does
> > Lucene has a column storage mechanism sort of like hash map for
> DocValues:
> >
>
> This really depends on the details of the Codec's encoding. So if its
> important to you the easiest way is to write a codec that compresses things
> the way you want (e.g. uses two-stage table/block RLE/something like that).
> Maybe this would be a good contribution to add to the codecs/ module of
> lucene so other people could use it too.
>
> I think some of the codecs might do this for some types: maybe just as an
> accidental side effect of their current compression/encoding (e.g. use of
> BlockPackedWriter). But its not something we really optimize for: as it
> doesn't make sense for a lot of use cases like scoring factors or faceting
> for docvalues. For example if you want to facet on tons of sparse fields,
> its probably better to use lucene's faceting module with uses one combined
> docvalues field for the document... I think.
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message