Mailing-List: contact java-dev-help@lucene.apache.org; run by ezmlm
Precedence: bulk
Reply-To: java-dev@lucene.apache.org
Received-SPF: pass (athena.apache.org: domain of peterlkeegan@gmail.com
 designates 209.85.219.226 as permitted sender)
DomainKey-Signature: a=rsa-sha1; c=nofws;
        d=gmail.com; s=gamma;
        h=mime-version:in-reply-to:references:date:message-id:subject:from:to
         :content-type;
        b=v3afxRdNtEtVYwPDcz805ZC294BjsgICxwKOu7QgPK7YCeRVKouwtLhN1sjSncI2d8
         6Wi0BVv1JjwO/wMZEP0WTuG3eC1+xrR6b/Ld9e+vjpIY7HRyULDKnT6P1OrfQttIYCJK
         Pu5zsF7KrTEa1rD9yCaV9RzfNGqeMFXxC/nO0=
MIME-Version: 1.0
In-Reply-To: <dea76a060906091617q26443be3i9df0e1313579fe62@mail.gmail.com>
References: <dea76a060906091617q26443be3i9df0e1313579fe62@mail.gmail.com>
Date: Wed, 24 Jun 2009 10:13:34 -0400
Message-ID: <e994873a0906240713g26d443a0j4668dfa8b64f735b@mail.gmail.com>
Subject: Re: Common Bottlenecks
From: Peter Keegan <peterlkeegan@gmail.com>
To: java-dev@lucene.apache.org
Content-Type: multipart/alternative; boundary=0015174c3388752af1046d18b630

--0015174c3388752af1046d18b630
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: 7bit

Our biggest bottleneck in searching is in a custom scorer which calls
AllTermDocs.next() very frequently. This class uses Lucene's own BitVector,
which I think is already highly optimized. Farther down in the list are
DocSetHitCollector.collect() and FieldSortedQueue.insert(). For indexing,
the main bottlneck is in the Analyzer/Filter, which is basically a
WhitespaceAnalyzer with custom code to add payloads to tokens and change the
positions between tokens.


Peter


On Tue, Jun 9, 2009 at 7:17 PM, Vico Marziale <vicodark@gmail.com> wrote:

> Hello all. I am new to Lucene as well as this list. I am a PhD student at
> the University of New Orleans. My current research in in leveraging
> highly-multicore processors to speed computer forensics tools. For the
> moment I am trying to figure out what the most common performance bottleneck
> inside of Lucene itself is. I will then take a crack at porting some (small)
> portion of Lucene to CUDA (http://www.nvidia.com/object/cuda_what_is.html)
> and see what kind of speedups are achievable.
>
> The portion of code to be ported must be trivially parallelizable. After
> spending some time digging around the docs and source, StandardAnalyzer
> appears to be a likely candidate. I've run the demo code through a profiler,
> but it was less than helpful, especially in light of the fact bottlenecks
> are going to be dependent on the way the Lucene API is used. In
> general, what is the most computationally expensive part of the process?
> Does the analyzer seem like a reasonable choice?
>
> Thanks,
> --
> Lodovico Marziale
> PhD Candidate
> Department of Computer Science
> University of New Orleans
>

--0015174c3388752af1046d18b630
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

Our biggest bottleneck in searching is in a custom scorer which calls AllTe=
rmDocs.next() very frequently. This class uses Lucene&#39;s own BitVector, =
which I think is already highly optimized. Farther down in the list are Doc=
SetHitCollector.collect() and FieldSortedQueue.insert(). For indexing, the =
main bottlneck is in the Analyzer/Filter, which is basically a WhitespaceAn=
alyzer with custom code to add payloads to tokens and change the positions =
between tokens.<br>
<br><br>Peter<br><br><br><div class=3D"gmail_quote">On Tue, Jun 9, 2009 at =
7:17 PM, Vico Marziale <span dir=3D"ltr">&lt;<a href=3D"mailto:vicodark@gma=
il.com">vicodark@gmail.com</a>&gt;</span> wrote:<br><blockquote class=3D"gm=
ail_quote" style=3D"border-left: 1px solid rgb(204, 204, 204); margin: 0pt =
0pt 0pt 0.8ex; padding-left: 1ex;">
Hello all. I am new to Lucene as well as this list. I am a PhD student at t=
he University of New Orleans. My current research in in leveraging highly-m=
ulticore processors to speed computer forensics tools. For the moment I am =
trying to figure out what the most common performance bottleneck inside of =
Lucene itself is. I will then take a crack at porting some (small) portion =
of Lucene to CUDA (<a href=3D"http://www.nvidia.com/object/cuda_what_is.htm=
l" target=3D"_blank">http://www.nvidia.com/object/cuda_what_is.html</a>) an=
d see what kind of speedups are achievable. <br>

<br>The portion of code to be ported must be trivially parallelizable. Afte=
r spending some time digging around the docs and source, StandardAnalyzer a=
ppears to be a likely candidate. I&#39;ve run the demo code through a profi=
ler, but it was less than helpful, especially in light of the fact bottlene=
cks are going to be dependent on the way the Lucene API is used. In<br>

general, what is the most computationally expensive part of the process? Do=
es the analyzer seem like a reasonable choice? =A0 <br clear=3D"all"><br>Th=
anks,<br><font color=3D"#888888">-- <br>Lodovico Marziale<br>PhD Candidate<=
br>
Department of Computer Science<br>
University of New Orleans<br>
</font></blockquote></div><br>

--0015174c3388752af1046d18b630--