lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Adrien Grand (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (LUCENE-3892) Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.)
Date Wed, 08 Aug 2012 19:38:21 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-3892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13431324#comment-13431324
] 

Adrien Grand edited comment on LUCENE-3892 at 8/8/12 7:38 PM:
--------------------------------------------------------------

I did some changes to the {{BlockPacked}} codec:
 - encoding and decoding using int[] instead of long[]
 - selection of the format based on a configurable overhead ratio.

The results are encouraging (using acceptableOverheadRatio = PackedInts.DEFAULT = 20%):
{noformat}
                Task    QPS 3892 StdDev 3892QPS 3892-packedStdDev 3892-packed      Pct diff
            PKLookup      256.93        8.89      256.85        7.47   -6% -    6%
           OrHighLow      145.14        9.86      145.14        9.35  -12% -   14%
             Respell      110.26        1.84      110.27        2.01   -3% -    3%
         AndHighHigh      112.97        0.81      113.19        2.17   -2% -    2%
              Fuzzy1      102.15        1.47      102.86        3.13   -3% -    5%
          OrHighHigh       94.56        6.56       95.43        6.35  -11% -   15%
              Fuzzy2       42.49        0.77       42.89        1.43   -4% -    6%
           OrHighMed      175.30       11.34      177.42       10.83  -10% -   14%
          AndHighLow     1925.02       23.92     1952.57       48.68   -2% -    5%
          HighPhrase        8.96        0.41        9.11        0.46   -7% -   11%
            Wildcard      189.79        2.13      193.12        1.57    0% -    3%
        HighSpanNear        6.47        0.15        6.59        0.25   -4% -    8%
             Prefix3      256.67        2.58      262.40        2.84    0% -    4%
             LowTerm     1746.52       52.80     1789.54       54.30   -3% -    8%
            HighTerm      238.70       13.46      245.63       16.60   -9% -   16%
             MedTerm      923.64       38.19      951.18       46.85   -5% -   12%
          AndHighMed      364.46        3.65      377.09       10.03    0% -    7%
              IntNRQ       56.58        1.02       58.84        0.80    0% -    7%
    HighSloppyPhrase       11.73        0.30       12.40        0.62   -2% -   13%
         LowSpanNear       29.64        0.96       32.44        0.98    2% -   16%
         MedSpanNear       22.96        0.72       25.16        0.85    2% -   16%
           MedPhrase       40.99        1.25       45.09        1.24    3% -   16%
     LowSloppyPhrase       37.88        0.99       41.98        1.49    4% -   17%
           LowPhrase       64.40        2.04       71.84        1.41    5% -   17%
     MedSloppyPhrase       42.29        1.16       47.32        1.54    5% -   18%
{noformat}

I hope this will be confirmed on your computers this time .:-)
                
      was (Author: jpountz):
    I did some changes to the {{BlockPacked}} codec:
 - encoding and decoding using int[] instead of long[]
 - selection of the format based on a configurable overhead ratio.

The results are encouraging:
{noformat}
                Task    QPS 3892 StdDev 3892QPS 3892-packedStdDev 3892-packed      Pct diff
            PKLookup      256.93        8.89      256.85        7.47   -6% -    6%
           OrHighLow      145.14        9.86      145.14        9.35  -12% -   14%
             Respell      110.26        1.84      110.27        2.01   -3% -    3%
         AndHighHigh      112.97        0.81      113.19        2.17   -2% -    2%
              Fuzzy1      102.15        1.47      102.86        3.13   -3% -    5%
          OrHighHigh       94.56        6.56       95.43        6.35  -11% -   15%
              Fuzzy2       42.49        0.77       42.89        1.43   -4% -    6%
           OrHighMed      175.30       11.34      177.42       10.83  -10% -   14%
          AndHighLow     1925.02       23.92     1952.57       48.68   -2% -    5%
          HighPhrase        8.96        0.41        9.11        0.46   -7% -   11%
            Wildcard      189.79        2.13      193.12        1.57    0% -    3%
        HighSpanNear        6.47        0.15        6.59        0.25   -4% -    8%
             Prefix3      256.67        2.58      262.40        2.84    0% -    4%
             LowTerm     1746.52       52.80     1789.54       54.30   -3% -    8%
            HighTerm      238.70       13.46      245.63       16.60   -9% -   16%
             MedTerm      923.64       38.19      951.18       46.85   -5% -   12%
          AndHighMed      364.46        3.65      377.09       10.03    0% -    7%
              IntNRQ       56.58        1.02       58.84        0.80    0% -    7%
    HighSloppyPhrase       11.73        0.30       12.40        0.62   -2% -   13%
         LowSpanNear       29.64        0.96       32.44        0.98    2% -   16%
         MedSpanNear       22.96        0.72       25.16        0.85    2% -   16%
           MedPhrase       40.99        1.25       45.09        1.24    3% -   16%
     LowSloppyPhrase       37.88        0.99       41.98        1.49    4% -   17%
           LowPhrase       64.40        2.04       71.84        1.41    5% -   17%
     MedSloppyPhrase       42.29        1.16       47.32        1.54    5% -   18%
{noformat}

I hope this will be confirmed on your computers this time .:-)
                  
> Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.)
> -------------------------------------------------------------------------------------
>
>                 Key: LUCENE-3892
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3892
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Michael McCandless
>              Labels: gsoc2012, lucene-gsoc-12
>             Fix For: 4.1
>
>         Attachments: LUCENE-3892-BlockTermScorer.patch, LUCENE-3892-blockFor&hardcode(base).patch,
LUCENE-3892-blockFor&packedecoder(comp).patch, LUCENE-3892-blockFor-with-packedints-decoder.patch,
LUCENE-3892-blockFor-with-packedints-decoder.patch, LUCENE-3892-blockFor-with-packedints.patch,
LUCENE-3892-bulkVInt.patch, LUCENE-3892-direct-IntBuffer.patch, LUCENE-3892-for&pfor-with-javadoc.patch,
LUCENE-3892-handle_open_files.patch, LUCENE-3892-pfor-compress-iterate-numbits.patch, LUCENE-3892-pfor-compress-slow-estimate.patch,
LUCENE-3892_for_byte[].patch, LUCENE-3892_for_int[].patch, LUCENE-3892_for_unfold_method.patch,
LUCENE-3892_pfor_unfold_method.patch, LUCENE-3892_pulsing_support.patch, LUCENE-3892_settings.patch,
LUCENE-3892_settings.patch
>
>
> On the flex branch we explored a number of possible intblock
> encodings, but for whatever reason never brought them to completion.
> There are still a number of issues opened with patches in different
> states.
> Initial results (based on prototype) were excellent (see
> http://blog.mikemccandless.com/2010/08/lucene-performance-with-pfordelta-codec.html
> ).
> I think this would make a good GSoC project.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message