lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Michael McCandless (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (LUCENE-3892) Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.)
Date Wed, 20 Jun 2012 11:40:44 GMT

     [ https://issues.apache.org/jira/browse/LUCENE-3892?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Michael McCandless updated LUCENE-3892:
---------------------------------------

    Attachment: LUCENE-3892-direct-IntBuffer.patch

The For index is 5.2 GB vs 4.9 GB for vInt: not bad to have only 5%
increase in index size when using For PF (10M wikipedia index).

{quote}
Get more direct access to the file as an int[]; eg MMapDir could
expose an IntBuffer from its ByteBuffer (saving the initial copy
into byte[] that we now do). 
{quote}

I tested this, by making hacked up changes to Billy's For patch
requiring MMapDirectory and pulling an IntBuffer directly from its
ByteBuffer, saving one copy of bytes into the byte[] first.  But,
curiously, it didn't seem to improve things much:

{noformat}
                Task    QPS base StdDev base     QPS for  StdDev for      Pct diff
          AndHighMed       24.32        0.60       14.24        0.41  -44% -  -38%
            PKLookup      131.98        3.09      108.35        1.47  -20% -  -14%
         AndHighHigh        5.36        0.18        4.66        0.02  -16% -   -9%
              Phrase        1.48        0.02        1.33        0.10  -18% -   -2%
        SloppyPhrase        1.40        0.04        1.26        0.03  -13% -   -5%
            SpanNear        1.14        0.01        1.04        0.02  -10% -   -6%
              IntNRQ       12.13        0.70       11.27        0.46  -15% -    2%
             Prefix3       34.51        1.17       34.11        1.28   -8% -    6%
              Fuzzy1       90.63        1.74       89.68        1.46   -4% -    2%
             Respell       77.22        2.62       76.99        1.62   -5% -    5%
            Wildcard       11.84        0.40       12.20        0.37   -3% -    9%
              Fuzzy2       34.34        0.82       36.16        1.08    0% -   11%
      TermBGroup1M1P        4.71        0.11        5.02        0.18    0% -   12%
           OrHighMed        7.87        0.28        8.50        0.55   -2% -   19%
        TermBGroup1M        3.47        0.03        3.78        0.03    7% -   11%
         TermGroup1M        2.96        0.01        3.25        0.03    8% -   11%
          OrHighHigh        3.55        0.12        3.91        0.21    0% -   20%
                Term        9.72        0.28       10.87        0.44    4% -   19%
{noformat}

Maybe, instead, reading into an int[] and decoding from an int array
(hopefully avoiding bounds checks) will be faster than calling
IntBuffer.get for each encoded int...

                
> Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.)
> -------------------------------------------------------------------------------------
>
>                 Key: LUCENE-3892
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3892
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Michael McCandless
>              Labels: gsoc2012, lucene-gsoc-12
>             Fix For: 4.1
>
>         Attachments: LUCENE-3892-direct-IntBuffer.patch, LUCENE-3892_for.patch, LUCENE-3892_pfor.patch,
LUCENE-3892_pfor.patch, LUCENE-3892_pfor.patch, LUCENE-3892_settings.patch, LUCENE-3892_settings.patch
>
>
> On the flex branch we explored a number of possible intblock
> encodings, but for whatever reason never brought them to completion.
> There are still a number of issues opened with patches in different
> states.
> Initial results (based on prototype) were excellent (see
> http://blog.mikemccandless.com/2010/08/lucene-performance-with-pfordelta-codec.html
> ).
> I think this would make a good GSoC project.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message