lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Michael McCandless (JIRA)" <>
Subject [jira] [Commented] (LUCENE-3892) Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.)
Date Tue, 05 Jun 2012 10:06:24 GMT


Michael McCandless commented on LUCENE-3892:

Hi Billy,

bq. Can I get it from a wiki dump instead?

You can download it at

That's ~6.3 GB (compressed) and 28.7 GB (decompressed); it's the 2012/05/02 Wikipedia en export,
filtered to plain text and then broken into 33.3 M ~1 KB sized docs.  I can help you get the
luceneutil env set up...

bq. Indexing time is ~18% slower than Lucene40PostingsFormat (1071 sec vs 1261 sec).

Yes, it is expected, actually it scans every block 33 times to estimate metadata such as numFrameBits
and numExceptions.

OK, in that case I'm surprised it's only ~18% slower!
> Add a useful intblock postings format (eg, FOR, PFOR, PFORDelta, Simple9/16/64, etc.)
> -------------------------------------------------------------------------------------
>                 Key: LUCENE-3892
>                 URL:
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Michael McCandless
>              Labels: gsoc2012, lucene-gsoc-12
>             Fix For: 4.1
>         Attachments: LUCENE-3892_pfor.patch, LUCENE-3892_pfor.patch, LUCENE-3892_settings.patch,
> On the flex branch we explored a number of possible intblock
> encodings, but for whatever reason never brought them to completion.
> There are still a number of issues opened with patches in different
> states.
> Initial results (based on prototype) were excellent (see
> ).
> I think this would make a good GSoC project.

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:!default.jspa
For more information on JIRA, see:


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message