lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrzej Bialecki ...@getopt.org>
Subject Re: Performance issues with ConjunctionScorer
Date Tue, 22 Nov 2005 15:17:40 GMT
Andrzej Bialecki wrote:

> Hi,
>
> I've been profiling a Nutch installation, and to my surprise the 
> largest amount of throwaway allocations and the most time spent was 
> not in Nutch specific code, or IPC, but in Lucene 
> ConjunctionScorer.doNext() method. This method operates on a 
> LinkedList, which seems to be a huge bottleneck. Perhaps it would be 
> possible to replace LinkedList with a table?
>
> Nutch Summarizer also needlessly re-tokenizes the text over and over 
> again - perhaps it would be better to save already tokenized text in 
> parse_text, instead of the raw plain text? After all, the only use for 
> that text is to index it and then build the summaries.
>
> Please see the profiles here:
>
>    http://www.getopt.org/nutch/profile/index.html
>
Further input into this: after replacing the ConjunctionScorer with the 
fixed version from JIRA, now the bottleneck seems to be ... in 
Summarizer, of all things. :-)

I'm loading the DistributedSearch$Server to 100% CPU, and then the split 
is as follows:

* 82% NutchBean.getSummary() -> Summarizer.getSummary() -> getTokens() 
-> 65% NutchDocumentTokenizer.next()
* 14% NutchBean.search()
* 2% IPC

which is slightly ridiculuous... I think this makes a good case for 
storing pre-tokenized text in segments.

Regarding the allocation hot spots, we have the following top entries:

* 19.1% - 22,109 kB - 535,903 alloc. 
org.apache.lucene.index.TermBuffer.toTerm
* 38.8% - 44,998 kB - 937,937 alloc. 
org.apache.nutch.analysis.CommonGrams$Filter.next
 -> 29.6% - 34,380 kB - 717,713 alloc. 
org.apache.nutch.analysis.NutchDocumentTokenizer.next
* 13.8% - 15,989 kB - 12 alloc. org.apache.lucene.index.SegmentReader.norms

It seems that Nutch is uselessly re-tokenizing a lot of stuff - at this 
stage we shouldn't need any re-tokenization except for the user query... 
so I would argue that these parts should be redesigned to store and 
retrieve pre-tokenized values.

-- 
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message