lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael McCandless <>
Subject Re: [jira] Created: (LUCENE-1172) Small speedups to DocumentsWriter
Date Mon, 11 Feb 2008 19:30:41 GMT

Grant Ingersoll wrote:

> Also, perhaps we should spin off another thread to discuss how to  
> make DocsWriter easier to maintain.  My biggest concern is  
> understanding how the various threads work together, and a few  
> other areas but, like I said, let's spin up a separate thread to  
> brainstorm what is needed.

I agree we should work on simplifying it with time, and, spreading  
the knowledge of how it works.

> Note, that there is some risk in just using wikipedia for profiling  
> given it's distribution of terms, etc..

Good point.  Previously I was using Europarl, but, that corpus is  
just too fast to index.

Are you thinking Wikipedia is somewhat "dirty" (lots of extra terms  
not normally seen with clean content)?  Since I'm using  
StandardAnalyzer and not an analyzer based on the new  
WikipediaTokenizer, I'm getting even extra terms.  Also, I think we'd  
need an HTMLFilter in the chain since Wikipedia content uses HTML  
markup.  Grant, what analyzer chain do you use when you index Wikipedia?

> I also wonder if using the LineDocMaker is all that realistic a  
> profiling scenario.  While it is really useful in that it minimizes  
> IO interaction, etc. I can't help but feel that it isn't at all  
> close to typical usage.  Most users are not going to have all their  
> docs rolled up into a single file, 1 doc per line, so I wonder if  
> we potentially lose insight into how Lucene performs given that  
> other issues like I/O/memory used for loading files may force the  
> JVM/Lucene to not have the resources it needs.  Of course, I do  
> know it is good to try to isolate things so we can focus just on  
> Lucene, but we also should try to make some accounting for how it  
> lives in the wild.

I agree, this part is not realistic, and the intention is to measure  
just the indexing time.  In fact I expect most apps spend quite a bit  
more time building up a Document (filtering binary docs, etc) than  
actually indexing it.  The only real-world app that I can think of  
that would be close to LineDocMaker is using Lucene to search big log  
files, where one line = one Document.

> Last, I think it would be good to always attach/check in the .alg  
> file that is used when running the test, so that others can verify  
> on different systems/configurations, etc.

I did post the alg (under LUCENE-1172).  Though I see I forgot to  
{code} it and it looks messed up now.  My recent test to try a single  
quickSort(Object[]) were the same alg, just repeated 10 times instead  
of 3.

But I agree we should always post the alg for all tests...


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message