Return-Path: Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: (qmail 64417 invoked from network); 18 Feb 2009 14:00:38 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 18 Feb 2009 14:00:38 -0000 Received: (qmail 39448 invoked by uid 500); 18 Feb 2009 14:00:32 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 39117 invoked by uid 500); 18 Feb 2009 14:00:31 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 39106 invoked by uid 99); 18 Feb 2009 14:00:31 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 18 Feb 2009 06:00:31 -0800 X-ASF-Spam-Status: No, hits=-0.0 required=10.0 tests=SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: local policy) Received: from [124.153.73.149] (HELO mail.mapmyindia.com) (124.153.73.149) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 18 Feb 2009 14:00:21 +0000 Received: (qmail 5138 invoked by uid 510); 18 Feb 2009 19:29:55 +0530 Received: from 59.178.207.251 by mail.mapmyindia.com (envelope-from , uid 502) with qmail-scanner-1.25 (uvscan: v5.1.00/v5440. Clear:RC:0(59.178.207.251):. Processed in 5.588503 secs); 18 Feb 2009 13:59:55 -0000 Received: from triband-del-59.178.207.251.bol.net.in (HELO ?127.0.0.1?) (varun@mapmyindia.com@59.178.207.251) by mail.mapmyindia.com with SMTP; 18 Feb 2009 19:29:49 +0530 Message-ID: <499C1455.5080007@mapmyindia.com> Date: Wed, 18 Feb 2009 19:29:49 +0530 From: Varun Dhussa Organization: CE InfoSystems (P) Ltd User-Agent: Thunderbird 2.0.0.19 (Windows/20081209) MIME-Version: 1.0 To: java-user@lucene.apache.org Subject: Re: Lucene search performance on Sun UltraSparc T2 (T5120) servers References: <499BE497.2060103@mapmyindia.com> <18248.30864.qm@web26005.mail.ukl.yahoo.com> In-Reply-To: <18248.30864.qm@web26005.mail.ukl.yahoo.com> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit X-Antivirus: avast! (VPS 090217-0, 02/17/2009), Outbound message X-Antivirus-Status: Clean X-Virus-Checked: Checked by ClamAV on apache.org The method suggested would make the speed faster, but I doubt whether it would be substantial on processors with slower clock speed. Keeping in mind that most processors are going multi-core, it would make sense to multi-thread the scan. Any remarks are welcome! Varun Dhussa Product Architect CE InfoSystems (P) Ltd http://www.mapmyindia.com mark harwood wrote: > I was having some thoughts recently about speeding up fuzzy search. > > The current system does edit-distance on all terms A-Z, single threaded. Prefix length can reduce the search space and there is a "minimum similarity" threshold but that's roughly where we are. Multithreading this to make use of multiple CPUs is one option to look at but I was mainly thinking about smarter ways to do the fuzzy scan: > > I had the notion that we could move to a solution where a priority queue keeps the "best matches so far" and as you progress through the termEnum you could bail out of edit distance calculations quickly using a rough(cheap) assessment of if the current term is likely to make the cut (i.e. beat the current lowest score in the priority queue). It would make sense to populate the priority queue ASAP with terms that are most likely to be the best matches and these will be the ones that share a reasonable length prefix. > As an example - searching for Obama~ > > 1) Create "best matches" priority queue > 2) Scan all terms from oba to obz populating priority queue > 3) Scan all terms from "a" to "oba" and "obz" to "z", exiting quickly if the term fails to meet lowest score in the priority queue. > > How we "exit quickly" and how we determine what prefix to use in 2) are to be determined but the principle seems reasonable > > Thoughts? > > > > > ----- Original Message ---- > From: Varun Dhussa > To: java-user@lucene.apache.org > Sent: Wednesday, 18 February, 2009 10:36:07 > Subject: Lucene search performance on Sun UltraSparc T2 (T5120) servers > > Hi, > > I have had a bad experience when migrating my application from Intel Xeon based servers to Sun UltraSparc T2 T5120 servers. Lucene fuzzy search just does not perform. A search which took approximately 500 ms takes more than 6 seconds to execute. > > The index has about 100,000,000 records. So, I tried to split it into 10 indices and used the ParallelSearcher on it, but still got similar results. > > I am guessing that this is because the distance implementation used by Lucene requires higher clock speed and can't be parallelized much. > > Please advice > > -- Varun Dhussa > Product Architect > CE InfoSystems (P) Ltd > http://www.mapmyindia.com > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org > For additional commands, e-mail: java-user-help@lucene.apache.org > > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org > For additional commands, e-mail: java-user-help@lucene.apache.org > > > --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org For additional commands, e-mail: java-user-help@lucene.apache.org