lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Michael McCandless (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (LUCENE-3892) Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.)
Date Mon, 04 Jun 2012 17:01:25 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-3892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13288675#comment-13288675
] 

Michael McCandless commented on LUCENE-3892:
--------------------------------------------

Excellent!  All tests also pass for me w/ PFor postings format as
well... this is a great starting point :) One Solr test failed
(ContentStreamTest)... but I think it was false failure...

I did notice the tests seem to run slower, especially certain ones eg
TestJoinUtil.

Still missing a couple license headers (TestMin, TestCompress)...

I ran a quick perf test using
http://code.google.com/a/apache-extras.org/p/luceneutil on a 10M doc
Wikipedia index.

Indexing time is ~18% slower than Lucene40PostingsFormat (1071 sec vs
1261 sec).

But more important is the slower search times:

{noformat}
                Task    QPS base StdDev base    QPS pfor StdDev pfor      Pct diff
              Phrase        8.52        0.50        4.43        0.40  -55% -  -39%
        SloppyPhrase       12.52        0.39        7.87        0.51  -43% -  -30%
          AndHighMed       67.69        2.82       44.22        1.47  -39% -  -29%
            SpanNear        5.19        0.12        3.90        0.28  -31% -  -17%
            PKLookup      112.16        1.71       95.61        1.30  -17% -  -12%
         AndHighHigh       13.22        0.34       11.86        0.72  -17% -   -2%
            Wildcard       46.04        0.37       41.68        4.45  -19% -    1%
              Fuzzy1       50.11        2.03       48.06        1.91  -11% -    3%
           OrHighMed        9.26        0.48        8.90        0.37  -12% -    5%
          OrHighHigh       12.28        0.56       11.83        0.49  -11% -    5%
      TermBGroup1M1P       40.47        1.94       39.88        2.51  -11% -   10%
              Fuzzy2       53.71        2.66       53.01        2.08   -9% -    7%
         TermGroup1M       36.46        1.21       35.99        1.58   -8% -    6%
        TermBGroup1M       55.53        1.99       55.26        2.68   -8% -    8%
             Respell       69.71        4.49       69.73        2.07   -8% -   10%
                Term       94.38        7.62       94.96       12.19  -18% -   23%
             Prefix3       41.63        0.34       42.21        5.82  -13% -   16%
              IntNRQ        7.08        0.15        7.28        1.29  -17% -   23%
{noformat}

The queries that do skipping are quite a bit slower; this makes sense,
since on skip we do a full block decode.  A smaller block size (we use
128 now right?) should help I think.

It's strange that the non-skipping queries (Term, OrHighMed,
OrHighHigh) don't show any performance gain ... maybe we need to
optimize the decode... or it could be the removal of the bulk api
is hurting us here.

I'm also curious if we tried a pure FOR (no patching, so we must set
numBits according to the max value = larger index but hopefully faster
decode) if the results would improve...


                
> Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.)
> -------------------------------------------------------------------------------------
>
>                 Key: LUCENE-3892
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3892
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Michael McCandless
>              Labels: gsoc2012, lucene-gsoc-12
>             Fix For: 4.1
>
>         Attachments: LUCENE-3892_pfor.patch, LUCENE-3892_pfor.patch, LUCENE-3892_settings.patch,
LUCENE-3892_settings.patch
>
>
> On the flex branch we explored a number of possible intblock
> encodings, but for whatever reason never brought them to completion.
> There are still a number of issues opened with patches in different
> states.
> Initial results (based on prototype) were excellent (see
> http://blog.mikemccandless.com/2010/08/lucene-performance-with-pfordelta-codec.html
> ).
> I think this would make a good GSoC project.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message