lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Shai Erera (JIRA)" <>
Subject [jira] [Commented] (LUCENE-4609) Write a PackedIntsEncoder/Decoder for facets
Date Thu, 07 Feb 2013 16:15:13 GMT


Shai Erera commented on LUCENE-4609:

That will be interesting to test. In order to test it "fairly", we should either test the
decoder (that's what we usually test) through the abstracted code (i.e. via CategoryListIterator),
or Gilad, if you can, please copy CountingFacetsCollector and inline the decoder code instead
of the dgap+vint code? That will be simpler to test, with the least noise.

bq. I'm not sure SimpleIntEncoder was ever used

Mike and I tested it ... at some point :). I don't remember where we posted the results though,
whether it was in email, GTalk or some issue. But I do remember that the results were less
good than DGapVInt. We were always surprised by how fast DGapVInt is .. all the while we thought
VInt is expensive, but it may not be so expensive ... at least not on the Wikipedia collection.

bq. the decoding speed is significantly faster

That's good, but Mike and I have already concluded that EncodingSpeed just .. lies :). It's
a micro-benchmark, and while it showed significant improvements after I moved the encoders
to bulk-API, on the real-world scenario it performed worse. I had to inline stuff and specialize
it even more for it to beat the previous way things worked.

I will be glad if SemiPacked is faster .. but judging from past experience, I don't get my
hopes too high :).

As for this encoding algorithm, it all depends on how many values actually fall into the 256
range. That's another problem w/ EncodingSpeed -- it uses a real-world scenario of a crazy
application which encoded 2430 ordinals for a single document! You can see that the values
that are encoded are small, by e.g. looking at the NOnes bits/int. I suspect that in real-life,
there won't be many values that fall into that range, at least after some documents have been
indexed, because when you have a single category per-dimension in a document, then there are
not too many chances that their values will be "close".

But .. we should let luceneutil be the judge of that :). So Gilad, can you make a patch with
a SemiPackedCountingCollector? And also modify the default that FacetCollector.create returns,
so that it's easy to compare base (CountinFC) to comp (SemiPackedCFC). If you want to test
the collector, then run TestDemoFacets (as-is) and CountingFCTest (modify the collector though)
to make sure the Collector works.
> Write a PackedIntsEncoder/Decoder for facets
> --------------------------------------------
>                 Key: LUCENE-4609
>                 URL:
>             Project: Lucene - Core
>          Issue Type: New Feature
>          Components: modules/facet
>            Reporter: Shai Erera
>            Priority: Minor
>         Attachments: LUCENE-4609.patch, LUCENE-4609.patch, LUCENE-4609.patch, LUCENE-4609.patch,
LUCENE-4609.patch, SemiPackedEncoder.patch
> Today the facets API lets you write IntEncoder/Decoder to encode/decode the category
ordinals. We have several such encoders, including VInt (default), and block encoders.
> It would be interesting to implement and benchmark a PackedIntsEncoder/Decoder, with
potentially two variants: (1) receives bitsPerValue up front, when you e.g. know that you
have a small taxonomy and the max value you can see and (2) one that decides for each doc
on the optimal bitsPerValue, writes it as a header in the byte[] or something.

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message