Return-Path: Delivered-To: apmail-lucene-java-dev-archive@www.apache.org Received: (qmail 14319 invoked from network); 24 Jun 2009 14:13:56 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 24 Jun 2009 14:13:55 -0000 Received: (qmail 58980 invoked by uid 500); 24 Jun 2009 14:14:06 -0000 Delivered-To: apmail-lucene-java-dev-archive@lucene.apache.org Received: (qmail 58888 invoked by uid 500); 24 Jun 2009 14:14:05 -0000 Mailing-List: contact java-dev-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-dev@lucene.apache.org Delivered-To: mailing list java-dev@lucene.apache.org Received: (qmail 58880 invoked by uid 99); 24 Jun 2009 14:14:05 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 24 Jun 2009 14:14:05 +0000 X-ASF-Spam-Status: No, hits=2.2 required=10.0 tests=HTML_MESSAGE,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of peterlkeegan@gmail.com designates 209.85.219.226 as permitted sender) Received: from [209.85.219.226] (HELO mail-ew0-f226.google.com) (209.85.219.226) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 24 Jun 2009 14:13:55 +0000 Received: by ewy26 with SMTP id 26so300256ewy.5 for ; Wed, 24 Jun 2009 07:13:34 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:received:in-reply-to:references :date:message-id:subject:from:to:content-type; bh=RXgsFYHVWmKH3eD8xCdauAvEIrIHxbabg/kot6+7CSw=; b=oieFE7WPsH6xZ5h/TlA+bJO8IOocM9xunMpwoO68n1XE7psHfPsQ/9beaxqBB8AbS+ BPJNV7oBMdrDylwriX+1ibgeBhMHsRKO36YpSD5juGmZuWnROuFytGn4/ipec3/w+joQ er7GnFp0c59hfERV7ly6u7mLIQL8Lu87e7jdo= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; b=v3afxRdNtEtVYwPDcz805ZC294BjsgICxwKOu7QgPK7YCeRVKouwtLhN1sjSncI2d8 6Wi0BVv1JjwO/wMZEP0WTuG3eC1+xrR6b/Ld9e+vjpIY7HRyULDKnT6P1OrfQttIYCJK Pu5zsF7KrTEa1rD9yCaV9RzfNGqeMFXxC/nO0= MIME-Version: 1.0 Received: by 10.210.118.13 with SMTP id q13mr823353ebc.40.1245852814616; Wed, 24 Jun 2009 07:13:34 -0700 (PDT) In-Reply-To: References: Date: Wed, 24 Jun 2009 10:13:34 -0400 Message-ID: Subject: Re: Common Bottlenecks From: Peter Keegan To: java-dev@lucene.apache.org Content-Type: multipart/alternative; boundary=0015174c3388752af1046d18b630 X-Virus-Checked: Checked by ClamAV on apache.org --0015174c3388752af1046d18b630 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Our biggest bottleneck in searching is in a custom scorer which calls AllTermDocs.next() very frequently. This class uses Lucene's own BitVector, which I think is already highly optimized. Farther down in the list are DocSetHitCollector.collect() and FieldSortedQueue.insert(). For indexing, the main bottlneck is in the Analyzer/Filter, which is basically a WhitespaceAnalyzer with custom code to add payloads to tokens and change the positions between tokens. Peter On Tue, Jun 9, 2009 at 7:17 PM, Vico Marziale wrote: > Hello all. I am new to Lucene as well as this list. I am a PhD student at > the University of New Orleans. My current research in in leveraging > highly-multicore processors to speed computer forensics tools. For the > moment I am trying to figure out what the most common performance bottleneck > inside of Lucene itself is. I will then take a crack at porting some (small) > portion of Lucene to CUDA (http://www.nvidia.com/object/cuda_what_is.html) > and see what kind of speedups are achievable. > > The portion of code to be ported must be trivially parallelizable. After > spending some time digging around the docs and source, StandardAnalyzer > appears to be a likely candidate. I've run the demo code through a profiler, > but it was less than helpful, especially in light of the fact bottlenecks > are going to be dependent on the way the Lucene API is used. In > general, what is the most computationally expensive part of the process? > Does the analyzer seem like a reasonable choice? > > Thanks, > -- > Lodovico Marziale > PhD Candidate > Department of Computer Science > University of New Orleans > --0015174c3388752af1046d18b630 Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable Our biggest bottleneck in searching is in a custom scorer which calls AllTe= rmDocs.next() very frequently. This class uses Lucene's own BitVector, = which I think is already highly optimized. Farther down in the list are Doc= SetHitCollector.collect() and FieldSortedQueue.insert(). For indexing, the = main bottlneck is in the Analyzer/Filter, which is basically a WhitespaceAn= alyzer with custom code to add payloads to tokens and change the positions = between tokens.


Peter


On Tue, Jun 9, 2009 at = 7:17 PM, Vico Marziale <vicodark@gmail.com> wrote:
Hello all. I am new to Lucene as well as this list. I am a PhD student at t= he University of New Orleans. My current research in in leveraging highly-m= ulticore processors to speed computer forensics tools. For the moment I am = trying to figure out what the most common performance bottleneck inside of = Lucene itself is. I will then take a crack at porting some (small) portion = of Lucene to CUDA (http://www.nvidia.com/object/cuda_what_is.html) an= d see what kind of speedups are achievable.

The portion of code to be ported must be trivially parallelizable. Afte= r spending some time digging around the docs and source, StandardAnalyzer a= ppears to be a likely candidate. I've run the demo code through a profi= ler, but it was less than helpful, especially in light of the fact bottlene= cks are going to be dependent on the way the Lucene API is used. In
general, what is the most computationally expensive part of the process? Do= es the analyzer seem like a reasonable choice? =A0

Th= anks,
--
Lodovico Marziale
PhD Candidate<= br> Department of Computer Science
University of New Orleans

--0015174c3388752af1046d18b630--