lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Benson Margulies <>
Subject Re: Anticipating a benchmark for direct posting format
Date Mon, 07 Apr 2014 11:45:10 GMT
If you look at
( should also work),
you'll see the results of a luceneutil run that compares DPF to
'normal' on the 10M wikipedia case. Some things are better, some are
worse, some are the same.

The claim here was never that DPF was some sort of universal solvent;
it was that for certain applications it made a material speedup, and
so it was worth some API complexity to liberate it from the codecs.
I'm going to assert here that these results support the claim well
enough to justify taking a run at the API, and then we'll see if I can
come up with something that people find tolerable in proportion to the

On Thu, Apr 3, 2014 at 12:27 PM, Benson Margulies <> wrote:
> On Thu, Apr 3, 2014 at 11:37 AM, Michael McCandless
> <> wrote:
>> Is the benchmark just trying to measure speedups by using DirectPF vs
>> the default PF?  You could do this today w/ luceneutil (using
>> Wikipedia as content).
>> But if you have another content source / index, I'm happy to run the
>> benchmark.  It'd be easier to make the content available (CSV, or line
>> docs file format), then ship around big indices ...
>> I have a box with 48 GB RAM.
>> Mike McCandless
> My takeaway from the prior conversation was that various people didn't
> entirely believe that I'd seen a dramatic improvement in query perfo
> using D-P-F, and so would not smile upon a patch intended to liberate
> D-P-F from codecs. It could be that the effect I saw has to do with
> the fact that our system depends on hitting and scoring 50% of the
> documents in an index with a lot of documents.
> If you can help me try to simulate this situation with luceneutil, I'd
> be happy to skip the work I was about to do to build another
> benchmark.
>> On Thu, Apr 3, 2014 at 8:38 AM, Benson Margulies <> wrote:
>>> Some of you may recall that I started a thread some time ago about
>>> wishing for the benefits of the direct posting format without needing
>>> to use a codec. The thread landed as a challenge: show a benchmark of
>>> the benefit of D-P-F.
>>> After a lot of distraction, I'm now in a position to build it. The
>>> core is a rather large index, and to show the effect (always assuming
>>> that I succeed) will take a machine with a large amount of RAM.
>>> One approach is for me to simply build the index involved and make it
>>> available as an index. Another would be to side-step into a giant pile
>>> of  CSV or JSON and provide a do-it-yourself kit.
>>> Anyone have a preference?
>>> What have we got for hardware with, 40G of RAM? Anything, or will this
>>> be up to individuals to try out on dayjob hardware?
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail:
>>> For additional commands, e-mail:
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail:
>> For additional commands, e-mail:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message