lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From robert engels <reng...@ix.netcom.com>
Subject Re: [jira] Created: (LUCENE-1172) Small speedups to DocumentsWriter
Date Mon, 11 Feb 2008 15:35:01 GMT
Also, these couple of paging have some very good information on  
sorting, and why heapsort is even faster than quicksort...

http://users.aims.ac.za/~mackay/sorting/sorting.html
http://www.azillionmonkeys.com/qed/sort.html


On Feb 11, 2008, at 9:29 AM, robert engels wrote:

> My intent was not to diminish your hard work. We all appreciate it.  
> I was only trying to caution that 4% gains are not all what they  
> seem to be.
>
> If you looks at Arrays.java in the 1.5 JDK, and read through the  
> javadoc, you will quickly see that the sorting is well-thought out.
>
> They use a tuned quicksort for primitives, which offers O(n(log(n))  
> performance, and a modified mergesort for Objects guaranteeing O(n 
> (log(n)) performance. A standard quicksort has worst case  
> performance of O(n^2) ! Both use an insertion sort if the numbers  
> of elements is small.
>
> I can only assume that in their testing they chose a mergesort for  
> objects was to either: 1. have stable sort times, or more likely 2.  
> the merge sort has a better chance of being optimized by the JIT,  
> and/or the sequential access of elements makes for more efficient  
> object access in the JVM.
>
> These people that are far more capable than me chose one over the  
> other for I assume very good reasons - I just wish I knew what they  
> were.
>
> On Feb 11, 2008, at 5:14 AM, Michael McCandless wrote:
>
>> In fact I've found you need to pursue both the 2x type gains and also
>> the many smaller ones, to reach good performance.  And it requires
>> alot of ongoing vigilence to keep good performance.  You lose 3-4%
>> here and there and very quickly, very easily you're 2X slower.
>>
>> These tests are very real. I'm indexing Wikipedia content, using
>> StandardAnalyzer, running under contrib/benchmark.  It's true, in a
>> real app more time will be spent pulling documents from the source,
>> but I'm intentionally trying to minimize that in order to measure  
>> just
>> the indexing time.  Getting a 4% gain by replacing mergesort with
>> quicksort is real.
>>
>> If the profiler found other 4% gains, with such a small increase in
>> code complexity, I would passionately argue for those as well.  So
>> far it hasn't.
>>
>> Robert if you have some concrete ideas for the 2X type gains, I'm all
>> ears :)
>>
>> I certainly agree there is a point where complexity cost doesn't
>> offset the performance gain, but I think this particular change is
>> well before that point.
>>
>> Lucene's indexing throughput is an important metric in its
>> competitiveness with other search engines.  And I want Lucene to be
>> the best.
>>
>> Mike
>>
>> eks dev wrote:
>>
>>> again, as long as you do not make one step forward into actual  
>>> code, we will continue to have  what we have today, as this is  
>>> the best what we have.
>>>
>>> you made your statement:
>>> "Clear code will allow for more radical improvements as more eyes  
>>> will be able to easily understand the inner workings and offer  
>>> better algorithms",
>>>
>>> Not a single person here would ever dispute this statement, but  
>>> unfortunately there is no compiler that executes such statements.  
>>> Make a patch that utilizes this "clear-code" paradigm, show us   
>>> these better algorithms on actual example  and than say: "without  
>>> LUCENE-1172 I was able to improve XYZ feature by using ABC  
>>> algorithm". That would work smooth.
>>>
>>> Anyhow, I am not going to write more on this topic, sorry for the  
>>> noise...
>>>
>>> And Robert, please do not get this wrong, I see your point and I  
>>> respect it! I just felt slight unfairness to the people that make  
>>> the hands dirty writing as clear and fast code as possible.
>>>
>>>
>>>
>>>
>>> ----- Original Message ----
>>> From: robert engels <rengels@ix.netcom.com>
>>> To: java-dev@lucene.apache.org
>>> Sent: Monday, 11 February, 2008 9:55:02 AM
>>> Subject: Re: [jira] Created: (LUCENE-1172) Small speedups to  
>>> DocumentsWriter
>>>
>>> I am not disputing that there is a speed improvement. I am disputing
>>> that the performance gain of many of these patches is not worth the
>>> additional complexity in the code. Clear code will allow for more
>>> radical improvements as more eyes will be able to easily understand
>>> the inner workings and offer better algorithms, not just micro
>>> improvements that the JVM (eventually) can probably figure out on  
>>> its
>>> own.
>>>
>>> It is a value judgement, and regretfully I don't have another 30
>>> years to pass down the full knowledge behind my reasoning.
>>>
>>> Luckily, however, there are some very good books available on the
>>> subject...
>>>
>>> It's not the fault of the submitter, but many of these timings are
>>> suspect due to difficulty in measuring the improvements accurately.
>>>
>>> Here is a simple example:
>>>
>>> You can configure the JVM to not perform aggressive garbage
>>> collection, and write a program that generates a lot garbage -  
>>> but it
>>> runs very fast (not GCing), until the GC eventually occurs (if the
>>> program runs long enough). It may be overall much slower than an
>>> alternative that runs slower as it executes, but has code to manage
>>> the objects as they are created, and rarely if ever hits a GC cycle.
>>> But then, the JVM (e.g. generational GC) can implement improvements
>>> that makes choice A faster (and the better choice)... and the cycle
>>> continues...
>>>
>>> Without detailed timings and other metrics (GC pauses, IO, memory
>>> utilization, native compilation, etc.) most benchmarks are not very
>>> accurate or useful.  There are a lot of variables to consider -  
>>> maybe
>>> more so than can reasonably be considered.  That is why a 4% gain is
>>> highly suspect.  If the gain was 25%, or 50% or 100%, you have a
>>> better chance of it being an innate improvement, and not just the
>>> interaction of some other factors.
>>>
>>> On Feb 11, 2008, at 2:32 AM, eks dev wrote:
>>>
>>>> Robert,
>>>>
>>>> you may or may not be right, I do not know. The only way to prove
>>>> it would be to show you can do it better, no?
>>>> If you are so convinced this is wrong, you could, much better than
>>>> quoting textbooks:
>>>>
>>>> a) write better patch, get attention with something you think is
>>>> "better bottleneck"
>>>> b) provide realistic "performance tests" as you dispute the
>>>> measurement provided here
>>>>
>>>> It has to be that concrete, academic discussions are cool, but at
>>>> the end of a day, it is the code that executes that counts.
>>>>
>>>> cheers,
>>>> eks
>>>>
>>>> ----- Original Message ----
>>>> From: robert engels <rengels@ix.netcom.com>
>>>> To: java-dev@lucene.apache.org
>>>> Sent: Sunday, 10 February, 2008 9:15:30 PM
>>>> Subject: Re: [jira] Created: (LUCENE-1172) Small speedups to
>>>> DocumentsWriter
>>>>
>>>> I am not sure these numbers matter. I think they are skewed because
>>>> you are probably running too short a test, and the index is in  
>>>> memory
>>>> (or OS cache).
>>>>
>>>> Once you use a real index that needs to read/write from the  
>>>> disk, the
>>>> percentage change will be negligible.
>>>>
>>>> This is the problem with many of these "performance changes" - they
>>>> just aren't real world enough.  Even if they were, I would argue  
>>>> that
>>>> code simplicity/maintainability is worth more than 6 seconds on a
>>>> operation that takes 4 minutes to run...
>>>>
>>>> There are many people that believe micro benchmarks are next to
>>>> worthless. A good rule of thumb is that if the optimization doesn't
>>>> result in 2x speedup, it probably shouldn't be done. In most cases
>>>> any efficiency gains are later lost in maintainability issues.
>>>>
>>>> See http://en.wikipedia.org/wiki/Optimization_(computer_science)
>>>>
>>>> Almost always there is a better bottleneck somewhere.
>>>>
>>>> On Feb 10, 2008, at 1:37 PM, Michael McCandless wrote:
>>>>
>>>>>
>>>>> Yonik Seeley wrote:
>>>>>
>>>>>> I wonder how well a single generic quickSort(Object[] arr, int  
>>>>>> low,
>>>>>> int high) would perform vs the type-specific ones?  I guess  
>>>>>> the main
>>>>>> overhead would be a cast from Object to the specific class to  
>>>>>> do the
>>>>>> compare?  Too bad Java doesn't have true generics/templates.
>>>>>
>>>>>
>>>>> OK I tested this.
>>>>>
>>>>> Starting from the patch on LUCENE-1172, which has 3 quickSort  
>>>>> methods
>>>>> (one per type), I created a single quickSort method on Object[]  
>>>>> that
>>>>> takes a Comparator, and made 3 Comparators instead.
>>>>>
>>>>> Mac OS X 10.4 (JVM 1.5):
>>>>>
>>>>>     original patch --> 247.1
>>>>>   simplified patch --> 254.9 (3.2% slower)
>>>>>
>>>>> Windows Server 2003 R64 (JVM 1.6):
>>>>>
>>>>>     original patch --> 440.6
>>>>>   simplified patch --> 452.7 (2.7% slower)
>>>>>
>>>>> The times are best in 10 runs.  I'm running all tests with  
>>>>> these JVM
>>>>> args:
>>>>>
>>>>>   -Xms1024M -Xmx1024M -Xbatch -server
>>>>>
>>>>> I think this is a big enough difference in performance that it's
>>>>> worth keeping 3 separate quickSorts in DocumentsWriter.
>>>>>
>>>>> Mike
>>>>>
>>>>> ------------------------------------------------------------------ 
>>>>> ---
>>>>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>>>>> For additional commands, e-mail: java-dev-help@lucene.apache.org
>>>>>
>>>>
>>>>
>>>> ------------------------------------------------------------------- 
>>>> --
>>>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>>>> For additional commands, e-mail: java-dev-help@lucene.apache.org
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>       __________________________________________________________
>>>> Sent from Yahoo! Mail - a smarter inbox http://uk.mail.yahoo.com
>>>>
>>>>
>>>>
>>>> ------------------------------------------------------------------- 
>>>> --
>>>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene..apache.org
>>>> For additional commands, e-mail: java-dev-help@lucene.apache.org
>>>>
>>>
>>>
>>> -------------------------------------------------------------------- 
>>> -
>>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-dev-help@lucene.apache.org
>>>
>>>
>>>
>>>
>>>
>>>
>>>       __________________________________________________________
>>> Sent from Yahoo! Mail - a smarter inbox http://uk.mail.yahoo.com
>>>
>>>
>>>
>>> -------------------------------------------------------------------- 
>>> -
>>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-dev-help@lucene.apache.org
>>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-dev-help@lucene.apache.org
>>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message