lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From robert engels <reng...@ix.netcom.com>
Subject Re: [jira] Created: (LUCENE-1172) Small speedups to DocumentsWriter
Date Mon, 11 Feb 2008 16:02:12 GMT
One final thing, the guys responsible for the sorting in Arrays.java  
- Joshua Bloch and Neal Gafter.

Now I KNOW there must be a very good reason for the choices they made...

On Feb 11, 2008, at 9:35 AM, robert engels wrote:

> Also, these couple of paging have some very good information on  
> sorting, and why heapsort is even faster than quicksort...
>
> http://users.aims.ac.za/~mackay/sorting/sorting.html
> http://www.azillionmonkeys.com/qed/sort.html
>
>
> On Feb 11, 2008, at 9:29 AM, robert engels wrote:
>
>> My intent was not to diminish your hard work. We all appreciate  
>> it. I was only trying to caution that 4% gains are not all what  
>> they seem to be.
>>
>> If you looks at Arrays.java in the 1.5 JDK, and read through the  
>> javadoc, you will quickly see that the sorting is well-thought out.
>>
>> They use a tuned quicksort for primitives, which offers O(n(log 
>> (n)) performance, and a modified mergesort for Objects  
>> guaranteeing O(n(log(n)) performance. A standard quicksort has  
>> worst case performance of O(n^2) ! Both use an insertion sort if  
>> the numbers of elements is small.
>>
>> I can only assume that in their testing they chose a mergesort for  
>> objects was to either: 1. have stable sort times, or more likely  
>> 2. the merge sort has a better chance of being optimized by the  
>> JIT, and/or the sequential access of elements makes for more  
>> efficient object access in the JVM.
>>
>> These people that are far more capable than me chose one over the  
>> other for I assume very good reasons - I just wish I knew what  
>> they were.
>>
>> On Feb 11, 2008, at 5:14 AM, Michael McCandless wrote:
>>
>>> In fact I've found you need to pursue both the 2x type gains and  
>>> also
>>> the many smaller ones, to reach good performance.  And it requires
>>> alot of ongoing vigilence to keep good performance.  You lose 3-4%
>>> here and there and very quickly, very easily you're 2X slower.
>>>
>>> These tests are very real. I'm indexing Wikipedia content, using
>>> StandardAnalyzer, running under contrib/benchmark.  It's true, in a
>>> real app more time will be spent pulling documents from the source,
>>> but I'm intentionally trying to minimize that in order to measure  
>>> just
>>> the indexing time.  Getting a 4% gain by replacing mergesort with
>>> quicksort is real.
>>>
>>> If the profiler found other 4% gains, with such a small increase in
>>> code complexity, I would passionately argue for those as well.  So
>>> far it hasn't.
>>>
>>> Robert if you have some concrete ideas for the 2X type gains, I'm  
>>> all
>>> ears :)
>>>
>>> I certainly agree there is a point where complexity cost doesn't
>>> offset the performance gain, but I think this particular change is
>>> well before that point.
>>>
>>> Lucene's indexing throughput is an important metric in its
>>> competitiveness with other search engines.  And I want Lucene to be
>>> the best.
>>>
>>> Mike
>>>
>>> eks dev wrote:
>>>
>>>> again, as long as you do not make one step forward into actual  
>>>> code, we will continue to have  what we have today, as this is  
>>>> the best what we have.
>>>>
>>>> you made your statement:
>>>> "Clear code will allow for more radical improvements as more  
>>>> eyes will be able to easily understand the inner workings and  
>>>> offer better algorithms",
>>>>
>>>> Not a single person here would ever dispute this statement, but  
>>>> unfortunately there is no compiler that executes such  
>>>> statements. Make a patch that utilizes this "clear-code"  
>>>> paradigm, show us  these better algorithms on actual example   
>>>> and than say: "without LUCENE-1172 I was able to improve XYZ  
>>>> feature by using ABC algorithm". That would work smooth.
>>>>
>>>> Anyhow, I am not going to write more on this topic, sorry for  
>>>> the noise...
>>>>
>>>> And Robert, please do not get this wrong, I see your point and I  
>>>> respect it! I just felt slight unfairness to the people that  
>>>> make the hands dirty writing as clear and fast code as possible.
>>>>
>>>>
>>>>
>>>>
>>>> ----- Original Message ----
>>>> From: robert engels <rengels@ix.netcom.com>
>>>> To: java-dev@lucene.apache.org
>>>> Sent: Monday, 11 February, 2008 9:55:02 AM
>>>> Subject: Re: [jira] Created: (LUCENE-1172) Small speedups to  
>>>> DocumentsWriter
>>>>
>>>> I am not disputing that there is a speed improvement. I am  
>>>> disputing
>>>> that the performance gain of many of these patches is not worth the
>>>> additional complexity in the code. Clear code will allow for more
>>>> radical improvements as more eyes will be able to easily understand
>>>> the inner workings and offer better algorithms, not just micro
>>>> improvements that the JVM (eventually) can probably figure out  
>>>> on its
>>>> own.
>>>>
>>>> It is a value judgement, and regretfully I don't have another 30
>>>> years to pass down the full knowledge behind my reasoning.
>>>>
>>>> Luckily, however, there are some very good books available on the
>>>> subject...
>>>>
>>>> It's not the fault of the submitter, but many of these timings are
>>>> suspect due to difficulty in measuring the improvements accurately.
>>>>
>>>> Here is a simple example:
>>>>
>>>> You can configure the JVM to not perform aggressive garbage
>>>> collection, and write a program that generates a lot garbage -  
>>>> but it
>>>> runs very fast (not GCing), until the GC eventually occurs (if the
>>>> program runs long enough). It may be overall much slower than an
>>>> alternative that runs slower as it executes, but has code to manage
>>>> the objects as they are created, and rarely if ever hits a GC  
>>>> cycle.
>>>> But then, the JVM (e.g. generational GC) can implement improvements
>>>> that makes choice A faster (and the better choice)... and the cycle
>>>> continues...
>>>>
>>>> Without detailed timings and other metrics (GC pauses, IO, memory
>>>> utilization, native compilation, etc.) most benchmarks are not very
>>>> accurate or useful.  There are a lot of variables to consider -  
>>>> maybe
>>>> more so than can reasonably be considered.  That is why a 4%  
>>>> gain is
>>>> highly suspect.  If the gain was 25%, or 50% or 100%, you have a
>>>> better chance of it being an innate improvement, and not just the
>>>> interaction of some other factors.
>>>>
>>>> On Feb 11, 2008, at 2:32 AM, eks dev wrote:
>>>>
>>>>> Robert,
>>>>>
>>>>> you may or may not be right, I do not know. The only way to prove
>>>>> it would be to show you can do it better, no?
>>>>> If you are so convinced this is wrong, you could, much better than
>>>>> quoting textbooks:
>>>>>
>>>>> a) write better patch, get attention with something you think is
>>>>> "better bottleneck"
>>>>> b) provide realistic "performance tests" as you dispute the
>>>>> measurement provided here
>>>>>
>>>>> It has to be that concrete, academic discussions are cool, but at
>>>>> the end of a day, it is the code that executes that counts.
>>>>>
>>>>> cheers,
>>>>> eks
>>>>>
>>>>> ----- Original Message ----
>>>>> From: robert engels <rengels@ix.netcom.com>
>>>>> To: java-dev@lucene.apache.org
>>>>> Sent: Sunday, 10 February, 2008 9:15:30 PM
>>>>> Subject: Re: [jira] Created: (LUCENE-1172) Small speedups to
>>>>> DocumentsWriter
>>>>>
>>>>> I am not sure these numbers matter. I think they are skewed  
>>>>> because
>>>>> you are probably running too short a test, and the index is in  
>>>>> memory
>>>>> (or OS cache).
>>>>>
>>>>> Once you use a real index that needs to read/write from the  
>>>>> disk, the
>>>>> percentage change will be negligible.
>>>>>
>>>>> This is the problem with many of these "performance changes" -  
>>>>> they
>>>>> just aren't real world enough.  Even if they were, I would  
>>>>> argue that
>>>>> code simplicity/maintainability is worth more than 6 seconds on a
>>>>> operation that takes 4 minutes to run...
>>>>>
>>>>> There are many people that believe micro benchmarks are next to
>>>>> worthless. A good rule of thumb is that if the optimization  
>>>>> doesn't
>>>>> result in 2x speedup, it probably shouldn't be done. In most cases
>>>>> any efficiency gains are later lost in maintainability issues.
>>>>>
>>>>> See http://en.wikipedia.org/wiki/Optimization_(computer_science)
>>>>>
>>>>> Almost always there is a better bottleneck somewhere.
>>>>>
>>>>> On Feb 10, 2008, at 1:37 PM, Michael McCandless wrote:
>>>>>
>>>>>>
>>>>>> Yonik Seeley wrote:
>>>>>>
>>>>>>> I wonder how well a single generic quickSort(Object[] arr,  
>>>>>>> int low,
>>>>>>> int high) would perform vs the type-specific ones?  I guess 

>>>>>>> the main
>>>>>>> overhead would be a cast from Object to the specific class to
 
>>>>>>> do the
>>>>>>> compare?  Too bad Java doesn't have true generics/templates.
>>>>>>
>>>>>>
>>>>>> OK I tested this.
>>>>>>
>>>>>> Starting from the patch on LUCENE-1172, which has 3 quickSort  
>>>>>> methods
>>>>>> (one per type), I created a single quickSort method on Object 
>>>>>> [] that
>>>>>> takes a Comparator, and made 3 Comparators instead.
>>>>>>
>>>>>> Mac OS X 10.4 (JVM 1.5):
>>>>>>
>>>>>>     original patch --> 247.1
>>>>>>   simplified patch --> 254.9 (3.2% slower)
>>>>>>
>>>>>> Windows Server 2003 R64 (JVM 1.6):
>>>>>>
>>>>>>     original patch --> 440.6
>>>>>>   simplified patch --> 452.7 (2.7% slower)
>>>>>>
>>>>>> The times are best in 10 runs.  I'm running all tests with  
>>>>>> these JVM
>>>>>> args:
>>>>>>
>>>>>>   -Xms1024M -Xmx1024M -Xbatch -server
>>>>>>
>>>>>> I think this is a big enough difference in performance that it's
>>>>>> worth keeping 3 separate quickSorts in DocumentsWriter.
>>>>>>
>>>>>> Mike
>>>>>>
>>>>>> -----------------------------------------------------------------

>>>>>> ----
>>>>>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>>>>>> For additional commands, e-mail: java-dev-help@lucene.apache.org
>>>>>>
>>>>>
>>>>>
>>>>> ------------------------------------------------------------------ 
>>>>> ---
>>>>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>>>>> For additional commands, e-mail: java-dev-help@lucene.apache.org
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>       __________________________________________________________
>>>>> Sent from Yahoo! Mail - a smarter inbox http://uk.mail.yahoo.com
>>>>>
>>>>>
>>>>>
>>>>> ------------------------------------------------------------------ 
>>>>> ---
>>>>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene..apache.org
>>>>> For additional commands, e-mail: java-dev-help@lucene.apache.org
>>>>>
>>>>
>>>>
>>>> ------------------------------------------------------------------- 
>>>> --
>>>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>>>> For additional commands, e-mail: java-dev-help@lucene.apache.org
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>       __________________________________________________________
>>>> Sent from Yahoo! Mail - a smarter inbox http://uk.mail.yahoo.com
>>>>
>>>>
>>>>
>>>> ------------------------------------------------------------------- 
>>>> --
>>>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>>>> For additional commands, e-mail: java-dev-help@lucene.apache.org
>>>>
>>>
>>>
>>> -------------------------------------------------------------------- 
>>> -
>>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-dev-help@lucene.apache.org
>>>
>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message