lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Peter Keegan <peterlkee...@gmail.com>
Subject Re: Common Bottlenecks
Date Wed, 24 Jun 2009 14:13:34 GMT
Our biggest bottleneck in searching is in a custom scorer which calls
AllTermDocs.next() very frequently. This class uses Lucene's own BitVector,
which I think is already highly optimized. Farther down in the list are
DocSetHitCollector.collect() and FieldSortedQueue.insert(). For indexing,
the main bottlneck is in the Analyzer/Filter, which is basically a
WhitespaceAnalyzer with custom code to add payloads to tokens and change the
positions between tokens.


Peter


On Tue, Jun 9, 2009 at 7:17 PM, Vico Marziale <vicodark@gmail.com> wrote:

> Hello all. I am new to Lucene as well as this list. I am a PhD student at
> the University of New Orleans. My current research in in leveraging
> highly-multicore processors to speed computer forensics tools. For the
> moment I am trying to figure out what the most common performance bottleneck
> inside of Lucene itself is. I will then take a crack at porting some (small)
> portion of Lucene to CUDA (http://www.nvidia.com/object/cuda_what_is.html)
> and see what kind of speedups are achievable.
>
> The portion of code to be ported must be trivially parallelizable. After
> spending some time digging around the docs and source, StandardAnalyzer
> appears to be a likely candidate. I've run the demo code through a profiler,
> but it was less than helpful, especially in light of the fact bottlenecks
> are going to be dependent on the way the Lucene API is used. In
> general, what is the most computationally expensive part of the process?
> Does the analyzer seem like a reasonable choice?
>
> Thanks,
> --
> Lodovico Marziale
> PhD Candidate
> Department of Computer Science
> University of New Orleans
>

Mime
View raw message