lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Gilad Barkai (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (LUCENE-4609) Write a PackedIntsEncoder/Decoder for facets
Date Thu, 07 Feb 2013 15:31:14 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-4609?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13573590#comment-13573590
] 

Gilad Barkai edited comment on LUCENE-4609 at 2/7/13 3:30 PM:
--------------------------------------------------------------

Finally figured out I was doing things completely wrong.. instead of having a super smart
optimizing code for semi-packed encoder - there's now a strait forward semi-packed encoder:
Values smaller than 256 use only one byte (the value itself) and larger values are encoded
as VInt plus a leading zero byte. Worst case it can be 6 bytes per value (zero marker + 5
of VInt).

The idea, is to pay the penalty for variable length encoding for large values which should
be less common in a sort-uniq-dgap scenario.

Wrote two versions w and w/o dgap specialization, though I'm not sure how useful is the non-specialized
code.

I do not currently have the means to run the LuceneUtil (nor the wikipedia index with the
categories) - but I ran the {{EncodingSpeed}} test - and was surprised.
While the encoding is on a little worse (or on par) with dgap-vint, the decoding speed is
significantly faster. The new encode is the only (?!) encoder to beat {{SimpleIntEncoder}}
(which writes plain 32 bits per value) in decoding time.

Those values being used in the EncodingSpeed are real scenario, but I'm not sure how much
they represent a "common" case (e.g wikipedia). 

Mike - could you please try this encoder? I guess it only makes sense to run the specialized
{{DGapSemiPackedEncoder}}.
Also, I'm not sure {{SimpleIntEncoder}} was ever used (without any sorting, or unique). It
would be interesting to test it as well. We will pay in more I/O and much larger file size
(~4 times larger..) but it doesn't mean it will be any slower.

Here are the results of the EncodingSpeed test:
{noformat}
Estimating ~100000000 Integers compression time by
Encoding/decoding facets' ID payload of docID = 3630 (unsorted, length of: 2430) 41152 times.

Encoder                                                        Bits/Int          Encode Time
               Encode Time          Decode Time                Decode Time
                                                                              [milliseconds]
       [microsecond / int]       [milliseconds]        [microsecond / int]
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
Simple                                                          32.0000                  190
                    1.9000                  165                     1.6500
VInt8                                                           18.4955                  436
                    4.3600                  359                     3.5900
Sorting(Unique(VInt8))                                          18.4955                 3557
                   35.5702                  314                     3.1400
Sorting(Unique(DGap(VInt8)))                                     8.5597                 3485
                   34.8502                  270                     2.7000
Sorting(Unique(DGapVInt8))                                       8.5597                 3434
                   34.3402                  192                     1.9200
Sorting(Unique(DGap(SemiPacked)))                                8.6453                 3386
                   33.8602                  156                     1.5600
Sorting(Unique(DGapSemiPacked))                                  8.6453                 3397
                   33.9702                   99                     0.9900
Sorting(Unique(DGap(EightFlags(VInt))))                          4.9679                 4002
                   40.0203                  381                     3.8100
Sorting(Unique(DGap(FourFlags(VInt))))                           4.8198                 3972
                   39.7203                  399                     3.9900
Sorting(Unique(DGap(NOnes(3) (FourFlags(VInt)))))                4.5794                 4448
                   44.4803                  645                     6.4500
Sorting(Unique(DGap(NOnes(4) (FourFlags(VInt)))))                4.5794                 4461
                   44.6103                  641                     6.4100


Estimating ~100000000 Integers compression time by
Encoding/decoding facets' ID payload of docID = 9910 (unsorted, length of: 1489) 67159 times.

Encoder                                                        Bits/Int          Encode Time
               Encode Time          Decode Time                Decode Time
                                                                              [milliseconds]
       [microsecond / int]       [milliseconds]        [microsecond / int]
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
Simple                                                          32.0000                  169
                    1.6900                  163                     1.6300
VInt8                                                           18.2673                  421
                    4.2100                  374                     3.7400
Sorting(Unique(VInt8))                                          18.2673                 2230
                   22.3001                  337                     3.3700
Sorting(Unique(DGap(VInt8)))                                     8.9456                 2257
                   22.5701                  273                     2.7300
Sorting(Unique(DGapVInt8))                                       8.9456                 2214
                   22.1401                  192                     1.9200
Sorting(Unique(DGap(SemiPacked)))                                9.2787                 2162
                   21.6201                  180                     1.8000
Sorting(Unique(DGapSemiPacked))                                  9.2787                 2148
                   21.4801                  120                     1.2000
Sorting(Unique(DGap(EightFlags(VInt))))                          5.7542                 2937
                   29.3701                  395                     3.9500
Sorting(Unique(DGap(FourFlags(VInt))))                           5.5447                 2768
                   27.6801                  407                     4.0700
Sorting(Unique(DGap(NOnes(3) (FourFlags(VInt)))))                5.3566                 3294
                   32.9401                  651                     6.5100
Sorting(Unique(DGap(NOnes(4) (FourFlags(VInt)))))                5.3996                 3318
                   33.1801                  662                     6.6200


Estimating ~100000000 Integers compression time by
Encoding/decoding facets' ID payload of docID = 10000 (unsorted, length of: 18) 5555555 times.

Encoder                                                        Bits/Int          Encode Time
               Encode Time          Decode Time                Decode Time
                                                                              [milliseconds]
       [microsecond / int]       [milliseconds]        [microsecond / int]
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
Simple                                                          32.0000                  316
                    3.1600                  318                     3.1800
VInt8                                                           20.8889                  569
                    5.6900                  496                     4.9600
Sorting(Unique(VInt8))                                          20.8889                 1196
                   11.9600                  488                     4.8800
Sorting(Unique(DGap(VInt8)))                                    12.0000                 1151
                   11.5100                  481                     4.8100
Sorting(Unique(DGapVInt8))                                      12.0000                 1100
                   11.0000                  352                     3.5200
Sorting(Unique(DGap(SemiPacked)))                               14.6667                 1107
                   11.0700                  507                     5.0700
Sorting(Unique(DGapSemiPacked))                                 14.6667                 1037
                   10.3700                  439                     4.3900
Sorting(Unique(DGap(EightFlags(VInt))))                         10.2222                 1315
                   13.1500                  656                     6.5600
Sorting(Unique(DGap(FourFlags(VInt))))                          10.2222                 1408
                   14.0800                  675                     6.7500
Sorting(Unique(DGap(NOnes(3) (FourFlags(VInt)))))                9.7778                 1617
                   16.1700                  990                     9.9000
Sorting(Unique(DGap(NOnes(4) (FourFlags(VInt)))))               10.2222                 1708
                   17.0800                  992                     9.9200


Estimating ~100000000 Integers compression time by
Encoding/decoding facets' ID payload of docID = 501871 (unsorted, length of: 957) 104493 times.

Encoder                                                        Bits/Int          Encode Time
               Encode Time          Decode Time                Decode Time
                                                                              [milliseconds]
       [microsecond / int]       [milliseconds]        [microsecond / int]
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
Simple                                                          32.0000                  181
                    1.8100                  168                     1.6800
VInt8                                                           16.5768                  428
                    4.2800                  351                     3.5100
Sorting(Unique(VInt8))                                          16.5768                 1586
                   15.8600                  330                     3.3000
Sorting(Unique(DGap(VInt8)))                                     8.4848                 1477
                   14.7700                  267                     2.6700
Sorting(Unique(DGapVInt8))                                       8.4848                 1435
                   14.3500                  193                     1.9300
Sorting(Unique(DGap(SemiPacked)))                                8.6771                 1424
                   14.2400                  159                     1.5900
Sorting(Unique(DGapSemiPacked))                                  8.6771                 1402
                   14.0200                  100                     1.0000
Sorting(Unique(DGap(EightFlags(VInt))))                          4.4138                 1603
                   16.0300                  376                     3.7600
Sorting(Unique(DGap(FourFlags(VInt))))                           4.1797                 1700
                   17.0000                  382                     3.8200
Sorting(Unique(DGap(NOnes(3) (FourFlags(VInt)))))                3.8955                 2011
                   20.1100                  602                     6.0200
Sorting(Unique(DGap(NOnes(4) (FourFlags(VInt)))))                3.8871                 2024
                   20.2400                  594                     5.9400


{noformat}


I have another version which uses two bytes before using the variable length (e.g values smaller
than 0x10000 are encoded as is in two bytes, other values are encoded as two leading zero
bytes and the VInt) - but I did not optimize it yet, nor I'm sure it's very useful. 
                
      was (Author: gilad):
    Finally figured out I was doing things completely wrong.. instead of having a super smart
optimizing code for semi-packed encoder - there's now a strait forward semi-packed encoder:
Values smaller than 256 use only one byte (the value itself) and larger values are encoded
as VInt plus a leading zero byte. Worst case it can be 6 bytes per value (zero marker + 5
of VInt).

The idea, is to pay the penalty for variable length encoding for large values which should
be less common in a sort-uniq-dgap scenario.

Wrote two versions w and w/o dgap specialization, though I'm not sure how useful is the non-specialized
code.

I do not currently have the means to run the LuceneUtil (nor the wikipedia index with the
categories) - but I ran the {EncodingSpeed} test - and was surprised.
While the encoding is on a little worse (or on par) with dgap-vint, the decoding speed is
significantly faster. The new encode is the only (?!) encoder to beat {SimpleIntEncoder} (which
writes plain 32 bits per value) in decoding time.

Those values being used in the EncodingSpeed are real scenario, but I'm not sure how much
they represent a "common" case (e.g wikipedia). 

Mike - could you please try this encoder? I guess it only makes sense to run the specialized
{DGapSemiPackedEncoder}.
Also, I'm not sure SimpleIntEncoder was ever used (without any sorting, or unique). It would
be interesting to test it as well. We will pay in more I/O and much larger file size (~4 times
larger..) but it doesn't mean it will be any slower.

Here are the results of the EncodingSpeed test:
{noformat}
Estimating ~100000000 Integers compression time by
Encoding/decoding facets' ID payload of docID = 3630 (unsorted, length of: 2430) 41152 times.

Encoder                                                        Bits/Int          Encode Time
               Encode Time          Decode Time                Decode Time
                                                                              [milliseconds]
       [microsecond / int]       [milliseconds]        [microsecond / int]
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
Simple                                                          32.0000                  190
                    1.9000                  165                     1.6500
VInt8                                                           18.4955                  436
                    4.3600                  359                     3.5900
Sorting(Unique(VInt8))                                          18.4955                 3557
                   35.5702                  314                     3.1400
Sorting(Unique(DGap(VInt8)))                                     8.5597                 3485
                   34.8502                  270                     2.7000
Sorting(Unique(DGapVInt8))                                       8.5597                 3434
                   34.3402                  192                     1.9200
Sorting(Unique(DGap(SemiPacked)))                                8.6453                 3386
                   33.8602                  156                     1.5600
Sorting(Unique(DGapSemiPacked))                                  8.6453                 3397
                   33.9702                   99                     0.9900
Sorting(Unique(DGap(EightFlags(VInt))))                          4.9679                 4002
                   40.0203                  381                     3.8100
Sorting(Unique(DGap(FourFlags(VInt))))                           4.8198                 3972
                   39.7203                  399                     3.9900
Sorting(Unique(DGap(NOnes(3) (FourFlags(VInt)))))                4.5794                 4448
                   44.4803                  645                     6.4500
Sorting(Unique(DGap(NOnes(4) (FourFlags(VInt)))))                4.5794                 4461
                   44.6103                  641                     6.4100


Estimating ~100000000 Integers compression time by
Encoding/decoding facets' ID payload of docID = 9910 (unsorted, length of: 1489) 67159 times.

Encoder                                                        Bits/Int          Encode Time
               Encode Time          Decode Time                Decode Time
                                                                              [milliseconds]
       [microsecond / int]       [milliseconds]        [microsecond / int]
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
Simple                                                          32.0000                  169
                    1.6900                  163                     1.6300
VInt8                                                           18.2673                  421
                    4.2100                  374                     3.7400
Sorting(Unique(VInt8))                                          18.2673                 2230
                   22.3001                  337                     3.3700
Sorting(Unique(DGap(VInt8)))                                     8.9456                 2257
                   22.5701                  273                     2.7300
Sorting(Unique(DGapVInt8))                                       8.9456                 2214
                   22.1401                  192                     1.9200
Sorting(Unique(DGap(SemiPacked)))                                9.2787                 2162
                   21.6201                  180                     1.8000
Sorting(Unique(DGapSemiPacked))                                  9.2787                 2148
                   21.4801                  120                     1.2000
Sorting(Unique(DGap(EightFlags(VInt))))                          5.7542                 2937
                   29.3701                  395                     3.9500
Sorting(Unique(DGap(FourFlags(VInt))))                           5.5447                 2768
                   27.6801                  407                     4.0700
Sorting(Unique(DGap(NOnes(3) (FourFlags(VInt)))))                5.3566                 3294
                   32.9401                  651                     6.5100
Sorting(Unique(DGap(NOnes(4) (FourFlags(VInt)))))                5.3996                 3318
                   33.1801                  662                     6.6200


Estimating ~100000000 Integers compression time by
Encoding/decoding facets' ID payload of docID = 10000 (unsorted, length of: 18) 5555555 times.

Encoder                                                        Bits/Int          Encode Time
               Encode Time          Decode Time                Decode Time
                                                                              [milliseconds]
       [microsecond / int]       [milliseconds]        [microsecond / int]
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
Simple                                                          32.0000                  316
                    3.1600                  318                     3.1800
VInt8                                                           20.8889                  569
                    5.6900                  496                     4.9600
Sorting(Unique(VInt8))                                          20.8889                 1196
                   11.9600                  488                     4.8800
Sorting(Unique(DGap(VInt8)))                                    12.0000                 1151
                   11.5100                  481                     4.8100
Sorting(Unique(DGapVInt8))                                      12.0000                 1100
                   11.0000                  352                     3.5200
Sorting(Unique(DGap(SemiPacked)))                               14.6667                 1107
                   11.0700                  507                     5.0700
Sorting(Unique(DGapSemiPacked))                                 14.6667                 1037
                   10.3700                  439                     4.3900
Sorting(Unique(DGap(EightFlags(VInt))))                         10.2222                 1315
                   13.1500                  656                     6.5600
Sorting(Unique(DGap(FourFlags(VInt))))                          10.2222                 1408
                   14.0800                  675                     6.7500
Sorting(Unique(DGap(NOnes(3) (FourFlags(VInt)))))                9.7778                 1617
                   16.1700                  990                     9.9000
Sorting(Unique(DGap(NOnes(4) (FourFlags(VInt)))))               10.2222                 1708
                   17.0800                  992                     9.9200


Estimating ~100000000 Integers compression time by
Encoding/decoding facets' ID payload of docID = 501871 (unsorted, length of: 957) 104493 times.

Encoder                                                        Bits/Int          Encode Time
               Encode Time          Decode Time                Decode Time
                                                                              [milliseconds]
       [microsecond / int]       [milliseconds]        [microsecond / int]
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
Simple                                                          32.0000                  181
                    1.8100                  168                     1.6800
VInt8                                                           16.5768                  428
                    4.2800                  351                     3.5100
Sorting(Unique(VInt8))                                          16.5768                 1586
                   15.8600                  330                     3.3000
Sorting(Unique(DGap(VInt8)))                                     8.4848                 1477
                   14.7700                  267                     2.6700
Sorting(Unique(DGapVInt8))                                       8.4848                 1435
                   14.3500                  193                     1.9300
Sorting(Unique(DGap(SemiPacked)))                                8.6771                 1424
                   14.2400                  159                     1.5900
Sorting(Unique(DGapSemiPacked))                                  8.6771                 1402
                   14.0200                  100                     1.0000
Sorting(Unique(DGap(EightFlags(VInt))))                          4.4138                 1603
                   16.0300                  376                     3.7600
Sorting(Unique(DGap(FourFlags(VInt))))                           4.1797                 1700
                   17.0000                  382                     3.8200
Sorting(Unique(DGap(NOnes(3) (FourFlags(VInt)))))                3.8955                 2011
                   20.1100                  602                     6.0200
Sorting(Unique(DGap(NOnes(4) (FourFlags(VInt)))))                3.8871                 2024
                   20.2400                  594                     5.9400


{noformat}


I have another version which uses two bytes before using the variable length (e.g values smaller
than 0x10000 are encoded as is in two bytes, other values are encoded as two leading zero
bytes and the VInt) - but I did not optimize it yet, nor I'm sure it's very useful. 
                  
> Write a PackedIntsEncoder/Decoder for facets
> --------------------------------------------
>
>                 Key: LUCENE-4609
>                 URL: https://issues.apache.org/jira/browse/LUCENE-4609
>             Project: Lucene - Core
>          Issue Type: New Feature
>          Components: modules/facet
>            Reporter: Shai Erera
>            Priority: Minor
>         Attachments: LUCENE-4609.patch, LUCENE-4609.patch, LUCENE-4609.patch, LUCENE-4609.patch,
LUCENE-4609.patch, SemiPackedEncoder.patch
>
>
> Today the facets API lets you write IntEncoder/Decoder to encode/decode the category
ordinals. We have several such encoders, including VInt (default), and block encoders.
> It would be interesting to implement and benchmark a PackedIntsEncoder/Decoder, with
potentially two variants: (1) receives bitsPerValue up front, when you e.g. know that you
have a small taxonomy and the max value you can see and (2) one that decides for each doc
on the optimal bitsPerValue, writes it as a header in the byte[] or something.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message