lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From robert engels <>
Subject Re: [jira] Created: (LUCENE-1172) Small speedups to DocumentsWriter
Date Mon, 11 Feb 2008 03:33:53 GMT
Please chill. You are inferring something that was not implied. You  
may think it lacks perspective and respect (I disagree on both), but  
it certainly doesn't lack in correctness.

First, depending on how you measure it, 2x speedup equates to a 50%  
reduction in time. In my review of the changes that brought about the  
biggest performance gains from 1.9 on, almost all were related to  
avoiding disk accesses by buffering more documents and doing more  
processing in memory.  I don't think many of the micro-benchmarks  
mattered much, and with a JVM environment it is very difficult to  
prove as it is going to be heavily JVM and configuration dependent.

The main point was that ANY disk access is going to be ORDERS OF  
MAGNITUDE slower than any of these sort of optimizations.

So either you are loading the index completely in memory (only small  
indexes, so the difference in speed is not going to matter much), or  
you might be using a federated system of memory indices (to form a  
large index), but USUALLY at some point the index must be first  
created in a persistent store (that which is covered here), in order  
to provide realistic restart times, etc.

The author of the patch and timings gives no information as to disk  
speed, IO speed, controllers, raid configuration , etc. When creating  
an index in persistent store, these factors matter more than a 2-4%  
speed up.  Creating an index completely in memory is then bound by  
the reading of the data from the disk, and/or the network - all much  
slower than the actual indexing.

Usually optimizations like this only matter in areas of development  
where the data set is small, but the processing large (a lot of  
numerical analysis).  In some cases the data set may also be "large",  
but then usually the processing is exponentially larger.  The  
building of the index in Lucene in not very computationally expensive.

If you are going spend hundreds of hours "optimizing", you best be  
optimizing the right things. That was the point of the link I sent  
(the quotes are from people far more capable than I).

I was trying to make the point that a 2-4 % speed up probably doesn't  
amount to much in a real environment given all of the other factors,  
so it is probably better for the project/community to err on the side  
of code clarity and ease of maintenance.

The project can continue to do what it wants (obviously) - but what I  
was pointing out should be nothing new to experienced designers/ 
developers - I only offering a reminder. It is my observation (others  
will disagree !), but I think a lot of Lucene has some unneeded  
esoteric code, where the benefit doesn't match the cost.

On Feb 10, 2008, at 5:48 PM, Mike Klaas wrote:

> While I agree in general that excessive optimization at the expense  
> of code clarity is undesirable, you are overstating the point.  2X  
> is a ridiculous threshold to apply to something as performance  
> critical as a full text search engine.  If search was twice as  
> slow, lucene would be utterly unusable for me.  Indexing less  
> important than search, of course, but a 2X slowdown with be quite  
> painful there.
> I don't have an opinion in this case: I believe that there is a  
> tradeoff but that it is the responsibility of the commiter(s) to  
> achieve the correct balance--they are the ones who will be  
> maintaining the code, after all.  I find your persistence  
> surprising and your tone dangerously near condescending.  Telling  
> the guy who has spent hundreds of hours carefully optimizing this  
> code that "Almost always there is a better bottleneck somewhere"  
> shows an astonishing lack of perspective and respect.
> -Mike
> On 10-Feb-08, at 12:15 PM, robert engels wrote:
>> I am not sure these numbers matter. I think they are skewed  
>> because you are probably running too short a test, and the index  
>> is in memory (or OS cache).
>> Once you use a real index that needs to read/write from the disk,  
>> the percentage change will be negligible.
>> This is the problem with many of these "performance changes" -  
>> they just aren't real world enough.  Even if they were, I would  
>> argue that code simplicity/maintainability is worth more than 6  
>> seconds on a operation that takes 4 minutes to run...
>> There are many people that believe micro benchmarks are next to  
>> worthless. A good rule of thumb is that if the optimization  
>> doesn't result in 2x speedup, it probably shouldn't be done. In  
>> most cases any efficiency gains are later lost in maintainability  
>> issues.
>> See
>> Almost always there is a better bottleneck somewhere.
>> On Feb 10, 2008, at 1:37 PM, Michael McCandless wrote:
>>> Yonik Seeley wrote:
>>>> I wonder how well a single generic quickSort(Object[] arr, int low,
>>>> int high) would perform vs the type-specific ones?  I guess the  
>>>> main
>>>> overhead would be a cast from Object to the specific class to do  
>>>> the
>>>> compare?  Too bad Java doesn't have true generics/templates.
>>> OK I tested this.
>>> Starting from the patch on LUCENE-1172, which has 3 quickSort  
>>> methods
>>> (one per type), I created a single quickSort method on Object[] that
>>> takes a Comparator, and made 3 Comparators instead.
>>> Mac OS X 10.4 (JVM 1.5):
>>>     original patch --> 247.1
>>>   simplified patch --> 254.9 (3.2% slower)
>>> Windows Server 2003 R64 (JVM 1.6):
>>>     original patch --> 440.6
>>>   simplified patch --> 452.7 (2.7% slower)
>>> The times are best in 10 runs.  I'm running all tests with these JVM
>>> args:
>>>   -Xms1024M -Xmx1024M -Xbatch -server
>>> I think this is a big enough difference in performance that it's
>>> worth keeping 3 separate quickSorts in DocumentsWriter.
>>> Mike
>>> -------------------------------------------------------------------- 
>>> -
>>> To unsubscribe, e-mail:
>>> For additional commands, e-mail:
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail:
>> For additional commands, e-mail:
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message