lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Grant Ingersoll <gsing...@apache.org>
Subject Re: [jira] Created: (LUCENE-1172) Small speedups to DocumentsWriter
Date Mon, 11 Feb 2008 19:00:19 GMT
OK, I am convinced that this one is useful.  Also, perhaps we should  
spin off another thread to discuss how to make DocsWriter easier to  
maintain.  My biggest concern is understanding how the various threads  
work together, and a few other areas but, like I said, let's spin up a  
separate thread to brainstorm what is needed.

Note, that there is some risk in just using wikipedia for profiling  
given it's distribution of terms, etc..  I also wonder if using the  
LineDocMaker is all that realistic a profiling scenario.  While it is  
really useful in that it minimizes IO interaction, etc. I can't help  
but feel that it isn't at all close to typical usage.  Most users are  
not going to have all their docs rolled up into a single file, 1 doc  
per line, so I wonder if we potentially lose insight into how Lucene  
performs given that other issues like I/O/memory used for loading  
files may force the JVM/Lucene to not have the resources it needs.  Of  
course, I do know it is good to try to isolate things so we can focus  
just on Lucene, but we also should try to make some accounting for how  
it lives in the wild.

Last, I think it would be good to always attach/check in the .alg file  
that is used when running the test, so that others can verify on  
different systems/configurations, etc.

-Grant


On Feb 11, 2008, at 6:14 AM, Michael McCandless wrote:

> In fact I've found you need to pursue both the 2x type gains and also
> the many smaller ones, to reach good performance.  And it requires
> alot of ongoing vigilence to keep good performance.  You lose 3-4%
> here and there and very quickly, very easily you're 2X slower.
>
> These tests are very real. I'm indexing Wikipedia content, using
> StandardAnalyzer, running under contrib/benchmark.  It's true, in a
> real app more time will be spent pulling documents from the source,
> but I'm intentionally trying to minimize that in order to measure just
> the indexing time.  Getting a 4% gain by replacing mergesort with
> quicksort is real.
>
> If the profiler found other 4% gains, with such a small increase in
> code complexity, I would passionately argue for those as well.  So
> far it hasn't.
>
> Robert if you have some concrete ideas for the 2X type gains, I'm all
> ears :)
>
> I certainly agree there is a point where complexity cost doesn't
> offset the performance gain, but I think this particular change is
> well before that point.
>
> Lucene's indexing throughput is an important metric in its
> competitiveness with other search engines.  And I want Lucene to be
> the best.
>
> Mike
>
> eks dev wrote:
>
>> again, as long as you do not make one step forward into actual  
>> code, we will continue to have  what we have today, as this is the  
>> best what we have.
>>
>> you made your statement:
>> "Clear code will allow for more radical improvements as more eyes  
>> will be able to easily understand the inner workings and offer  
>> better algorithms",
>>
>> Not a single person here would ever dispute this statement, but  
>> unfortunately there is no compiler that executes such statements.  
>> Make a patch that utilizes this "clear-code" paradigm, show us   
>> these better algorithms on actual example  and than say: "without  
>> LUCENE-1172 I was able to improve XYZ feature by using ABC  
>> algorithm". That would work smooth.
>>
>> Anyhow, I am not going to write more on this topic, sorry for the  
>> noise...
>>
>> And Robert, please do not get this wrong, I see your point and I  
>> respect it! I just felt slight unfairness to the people that make  
>> the hands dirty writing as clear and fast code as possible.
>>
>>
>>
>>
>> ----- Original Message ----
>> From: robert engels <rengels@ix.netcom.com>
>> To: java-dev@lucene.apache.org
>> Sent: Monday, 11 February, 2008 9:55:02 AM
>> Subject: Re: [jira] Created: (LUCENE-1172) Small speedups to  
>> DocumentsWriter
>>
>> I am not disputing that there is a speed improvement. I am disputing
>> that the performance gain of many of these patches is not worth the
>> additional complexity in the code. Clear code will allow for more
>> radical improvements as more eyes will be able to easily understand
>> the inner workings and offer better algorithms, not just micro
>> improvements that the JVM (eventually) can probably figure out on its
>> own.
>>
>> It is a value judgement, and regretfully I don't have another 30
>> years to pass down the full knowledge behind my reasoning.
>>
>> Luckily, however, there are some very good books available on the
>> subject...
>>
>> It's not the fault of the submitter, but many of these timings are
>> suspect due to difficulty in measuring the improvements accurately.
>>
>> Here is a simple example:
>>
>> You can configure the JVM to not perform aggressive garbage
>> collection, and write a program that generates a lot garbage - but it
>> runs very fast (not GCing), until the GC eventually occurs (if the
>> program runs long enough). It may be overall much slower than an
>> alternative that runs slower as it executes, but has code to manage
>> the objects as they are created, and rarely if ever hits a GC cycle.
>> But then, the JVM (e.g. generational GC) can implement improvements
>> that makes choice A faster (and the better choice)... and the cycle
>> continues...
>>
>> Without detailed timings and other metrics (GC pauses, IO, memory
>> utilization, native compilation, etc.) most benchmarks are not very
>> accurate or useful.  There are a lot of variables to consider - maybe
>> more so than can reasonably be considered.  That is why a 4% gain is
>> highly suspect.  If the gain was 25%, or 50% or 100%, you have a
>> better chance of it being an innate improvement, and not just the
>> interaction of some other factors.
>>
>> On Feb 11, 2008, at 2:32 AM, eks dev wrote:
>>
>>> Robert,
>>>
>>> you may or may not be right, I do not know. The only way to prove
>>> it would be to show you can do it better, no?
>>> If you are so convinced this is wrong, you could, much better than
>>> quoting textbooks:
>>>
>>> a) write better patch, get attention with something you think is
>>> "better bottleneck"
>>> b) provide realistic "performance tests" as you dispute the
>>> measurement provided here
>>>
>>> It has to be that concrete, academic discussions are cool, but at
>>> the end of a day, it is the code that executes that counts.
>>>
>>> cheers,
>>> eks
>>>
>>> ----- Original Message ----
>>> From: robert engels <rengels@ix.netcom.com>
>>> To: java-dev@lucene.apache.org
>>> Sent: Sunday, 10 February, 2008 9:15:30 PM
>>> Subject: Re: [jira] Created: (LUCENE-1172) Small speedups to
>>> DocumentsWriter
>>>
>>> I am not sure these numbers matter. I think they are skewed because
>>> you are probably running too short a test, and the index is in  
>>> memory
>>> (or OS cache).
>>>
>>> Once you use a real index that needs to read/write from the disk,  
>>> the
>>> percentage change will be negligible.
>>>
>>> This is the problem with many of these "performance changes" - they
>>> just aren't real world enough.  Even if they were, I would argue  
>>> that
>>> code simplicity/maintainability is worth more than 6 seconds on a
>>> operation that takes 4 minutes to run...
>>>
>>> There are many people that believe micro benchmarks are next to
>>> worthless. A good rule of thumb is that if the optimization doesn't
>>> result in 2x speedup, it probably shouldn't be done. In most cases
>>> any efficiency gains are later lost in maintainability issues.
>>>
>>> See http://en.wikipedia.org/wiki/Optimization_(computer_science)
>>>
>>> Almost always there is a better bottleneck somewhere.
>>>
>>> On Feb 10, 2008, at 1:37 PM, Michael McCandless wrote:
>>>
>>>>
>>>> Yonik Seeley wrote:
>>>>
>>>>> I wonder how well a single generic quickSort(Object[] arr, int  
>>>>> low,
>>>>> int high) would perform vs the type-specific ones?  I guess the  
>>>>> main
>>>>> overhead would be a cast from Object to the specific class to do  
>>>>> the
>>>>> compare?  Too bad Java doesn't have true generics/templates.
>>>>
>>>>
>>>> OK I tested this.
>>>>
>>>> Starting from the patch on LUCENE-1172, which has 3 quickSort  
>>>> methods
>>>> (one per type), I created a single quickSort method on Object[]  
>>>> that
>>>> takes a Comparator, and made 3 Comparators instead.
>>>>
>>>> Mac OS X 10.4 (JVM 1.5):
>>>>
>>>>    original patch --> 247.1
>>>>  simplified patch --> 254.9 (3.2% slower)
>>>>
>>>> Windows Server 2003 R64 (JVM 1.6):
>>>>
>>>>    original patch --> 440.6
>>>>  simplified patch --> 452.7 (2.7% slower)
>>>>
>>>> The times are best in 10 runs.  I'm running all tests with these  
>>>> JVM
>>>> args:
>>>>
>>>>  -Xms1024M -Xmx1024M -Xbatch -server
>>>>
>>>> I think this is a big enough difference in performance that it's
>>>> worth keeping 3 separate quickSorts in DocumentsWriter.
>>>>
>>>> Mike
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>>>> For additional commands, e-mail: java-dev-help@lucene.apache.org
>>>>
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-dev-help@lucene.apache.org
>>>
>>>
>>>
>>>
>>>
>>>
>>>      __________________________________________________________
>>> Sent from Yahoo! Mail - a smarter inbox http://uk.mail.yahoo.com
>>>
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene..apache.org
>>> For additional commands, e-mail: java-dev-help@lucene.apache.org
>>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-dev-help@lucene.apache.org
>>
>>
>>
>>
>>
>>
>>      __________________________________________________________
>> Sent from Yahoo! Mail - a smarter inbox http://uk.mail.yahoo.com
>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-dev-help@lucene.apache.org
>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>

--------------------------
Grant Ingersoll
http://lucene.grantingersoll.com
http://www.lucenebootcamp.com

Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ





---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message