Return-Path: Delivered-To: apmail-jakarta-lucene-dev-archive@www.apache.org Received: (qmail 47272 invoked from network); 8 Jan 2004 18:13:56 -0000 Received: from daedalus.apache.org (HELO mail.apache.org) (208.185.179.12) by minotaur-2.apache.org with SMTP; 8 Jan 2004 18:13:56 -0000 Received: (qmail 15965 invoked by uid 500); 8 Jan 2004 18:13:44 -0000 Delivered-To: apmail-jakarta-lucene-dev-archive@jakarta.apache.org Received: (qmail 15945 invoked by uid 500); 8 Jan 2004 18:13:44 -0000 Mailing-List: contact lucene-dev-help@jakarta.apache.org; run by ezmlm Precedence: bulk List-Unsubscribe: List-Subscribe: List-Help: List-Post: List-Id: "Lucene Developers List" Reply-To: "Lucene Developers List" Delivered-To: mailing list lucene-dev@jakarta.apache.org Received: (qmail 15924 invoked from network); 8 Jan 2004 18:13:43 -0000 Received: from unknown (HELO gallantin.skynet.be) (195.238.2.124) by daedalus.apache.org with SMTP; 8 Jan 2004 18:13:43 -0000 Received: from DELLLAT1L34N0J (103-199.241.81.adsl.skynet.be [81.241.199.103]) by gallantin.skynet.be (8.12.9/8.12.9/Skynet-OUT-2.21) with SMTP id i08IDiPp010140 for ; Thu, 8 Jan 2004 19:13:44 +0100 (envelope-from ) Reply-To: From: "Jean-Francois Halleux" To: "Lucene Developers List" Subject: RE: Parallel search in MultiSearcher Date: Thu, 8 Jan 2004 19:11:52 +0100 Message-ID: MIME-Version: 1.0 Content-Type: text/plain; charset="US-ASCII" Content-Transfer-Encoding: 7bit X-Priority: 3 (Normal) X-MSMail-Priority: Normal X-Mailer: Microsoft Outlook IMO, Build 9.0.6604 (9.0.2911.0) In-Reply-To: <3FFC8CC1.2010803@apache.org> X-MimeOLE: Produced By Microsoft MimeOLE V6.00.2800.1165 Importance: Normal X-RAVMilter-Version: 8.4.3(snapshot 20030212) (gallantin.skynet.be) X-Spam-Rating: daedalus.apache.org 1.6.2 0/1000/N X-Spam-Rating: minotaur-2.apache.org 1.6.2 0/1000/N Hello, The following patch includes the two mentioned changes plus the unit test class and a light modification to testMultiSearch to allow subclassing it. I would be interested to know if somebody tried this on a multiple HDD machine or in a distributed index architecture if performance improvement is noticeable. KR, Jeff Halleux ---- Index: java/org/apache/lucene/search/ParallelMultiSearcher.java =================================================================== RCS file: java/org/apache/lucene/search/ParallelMultiSearcher.java diff -N java/org/apache/lucene/search/ParallelMultiSearcher.java --- /dev/null 1 Jan 1970 00:00:00 -0000 +++ java/org/apache/lucene/search/ParallelMultiSearcher.java 8 Jan 2004 18:00:00 -0000 @@ -0,0 +1,247 @@ +package org.apache.lucene.search; + +/* ==================================================================== + * The Apache Software License, Version 1.1 + * + * Copyright (c) 2001 The Apache Software Foundation. All rights + * reserved. + * + * Redistribution and use in source and binary forms, with or without + * modification, are permitted provided that the following conditions + * are met: + * + * 1. Redistributions of source code must retain the above copyright + * notice, this list of conditions and the following disclaimer. + * + * 2. Redistributions in binary form must reproduce the above copyright + * notice, this list of conditions and the following disclaimer in + * the documentation and/or other materials provided with the + * distribution. + * + * 3. The end-user documentation included with the redistribution, + * if any, must include the following acknowledgment: + * "This product includes software developed by the + * Apache Software Foundation (http://www.apache.org/)." + * Alternately, this acknowledgment may appear in the software itself, + * if and wherever such third-party acknowledgments normally appear. + * + * 4. The names "Apache" and "Apache Software Foundation" and + * "Apache Lucene" must not be used to endorse or promote products + * derived from this software without prior written permission. For + * written permission, please contact apache@apache.org. + * + * 5. Products derived from this software may not be called "Apache", + * "Apache Lucene", nor may "Apache" appear in their name, without + * prior written permission of the Apache Software Foundation. + * + * THIS SOFTWARE IS PROVIDED ``AS IS'' AND ANY EXPRESSED OR IMPLIED + * WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES + * OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE + * DISCLAIMED. IN NO EVENT SHALL THE APACHE SOFTWARE FOUNDATION OR + * ITS CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, + * SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT + * LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF + * USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND + * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, + * OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT + * OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF + * SUCH DAMAGE. + * ==================================================================== + * + * This software consists of voluntary contributions made by many + * individuals on behalf of the Apache Software Foundation. For more + * information on the Apache Software Foundation, please see + * . + */ + +import java.io.IOException; + +import org.apache.lucene.index.Term; + +/** Implements parallel search over a set of Searchables. + * + *

Applications usually need only call the inherited {@link #search(Query)} + * or {@link #search(Query,Filter)} methods. + */ +public class ParallelMultiSearcher extends MultiSearcher { + private Searchable[] searchables; + private int[] starts; + private int maxDoc = 0; + + /** Creates a searcher which searches searchables. */ + public ParallelMultiSearcher(Searchable[] searchables) throws IOException { + super(searchables); + } + + /** + * TODO: parallelize this one too + */ + public int docFreq(Term term) throws IOException { + int docFreq = 0; + for (int i = 0; i < searchables.length; i++) + docFreq += searchables[i].docFreq(term); + return docFreq; + } + + /** + * A search implementation which spans a new thread for each + * Searchable, waits for each search to complete and merge + * the results back together. + */ + public TopDocs search(Query query, Filter filter, int nDocs) + throws IOException { + HitQueue hq = new HitQueue(nDocs); + int totalHits = 0; + MultiSearcherThread[] msta = + new MultiSearcherThread[searchables.length]; + + for (int i = 0; i < searchables.length; i++) { // search each searcher + // Assume not too many searchables and cost of creating a thread is by far inferior to a search + msta[i] = + new MultiSearcherThread( + searchables[i], + query, + filter, + nDocs, + hq, + i, + starts, + "MultiSearcher thread #" + (i + 1)); + msta[i].start(); + } + + for (int i = 0; i < searchables.length; i++) { + try { + msta[i].join(); + } catch (InterruptedException ie) { + ; // TODO: what should we do with this??? + } + IOException ioe = msta[i].getIOException(); + if (ioe == null) { + totalHits += msta[i].hits(); + } else { + // if one search produced an IOException, rethrow it + throw ioe; + } + } + + ScoreDoc[] scoreDocs = new ScoreDoc[hq.size()]; + for (int i = hq.size() - 1; i >= 0; i--) // put docs in array + scoreDocs[i] = (ScoreDoc) hq.pop(); + + return new TopDocs(totalHits, scoreDocs); + } + + /** Lower-level search API. + * + *

{@link HitCollector#collect(int,float)} is called for every non-zero + * scoring document. + * + *

Applications should only use this if they need all of the + * matching documents. The high-level search API ({@link + * Searcher#search(Query)}) is usually more efficient, as it skips + * non-high-scoring hits. + * + * @param query to match documents + * @param filter if non-null, a bitset used to eliminate some documents + * @param results to receive hits + * + * TODO: parallelize this one too + */ + public void search(Query query, Filter filter, final HitCollector results) + throws IOException { + for (int i = 0; i < searchables.length; i++) { + + final int start = starts[i]; + + searchables[i].search(query, filter, new HitCollector() { + public void collect(int doc, float score) { + results.collect(doc + start, score); + } + }); + + } + } + + /* + * TODO: this one could be parallelized too + * @see org.apache.lucene.search.Searchable#rewrite(org.apache.lucene.search.Query) + */ + public Query rewrite(Query original) throws IOException { + Query[] queries = new Query[searchables.length]; + for (int i = 0; i < searchables.length; i++) { + queries[i] = searchables[i].rewrite(original); + } + return original.combine(queries); + } + +} + +/** + * A thread subclass for searching a single searchable + */ +class MultiSearcherThread extends Thread { + + private Searchable searchable; + private Query query; + private Filter filter; + private int nDocs; + private int hits; + private TopDocs docs; + private int i; + private HitQueue hq; + private int[] starts; + private IOException ioe; + + public MultiSearcherThread( + Searchable searchable, + Query query, + Filter filter, + int nDocs, + HitQueue hq, + int i, + int[] starts, + String name) { + super(name); + this.searchable = searchable; + this.query = query; + this.filter = filter; + this.nDocs = nDocs; + this.hq = hq; + this.i = i; + this.starts = starts; + } + + public void run() { + try { + docs = searchable.search(query, filter, nDocs); + } + // Store the IOException for later use by the caller of this thread + catch (IOException ioe) { + this.ioe = ioe; + } + if (ioe == null) { + ScoreDoc[] scoreDocs = docs.scoreDocs; + for (int j = 0; + j < scoreDocs.length; + j++) { // merge scoreDocs into hq + ScoreDoc scoreDoc = scoreDocs[j]; + scoreDoc.doc += starts[i]; // convert doc + //it would be so nice if we had a thread-safe insert + synchronized (hq) { + if (!hq.insert(scoreDoc)) + break; + } // no more scores > minScore + } + } + } + + public int hits() { + return docs.totalHits; + } + + public IOException getIOException() { + return ioe; + } + +} Index: test/org/apache/lucene/search/TestMultiSearcher.java =================================================================== RCS file: /home/cvspublic/jakarta-lucene/src/test/org/apache/lucene/search/TestMultiSe archer.java,v retrieving revision 1.4 diff -u -r1.4 TestMultiSearcher.java --- test/org/apache/lucene/search/TestMultiSearcher.java 26 Nov 2002 17:31:43 -0000 1.4 +++ test/org/apache/lucene/search/TestMultiSearcher.java 8 Jan 2004 18:00:02 -0000 @@ -81,6 +81,14 @@ super(name); } + /** + * Return a new instance of the concrete MultiSearcher class + * used in this test + */ + protected MultiSearcher getMultiSearcherInstance(Searcher[] searchers) throws IOException { + return new MultiSearcher(searchers); + } + public void testEmptyIndex() throws Exception { @@ -134,7 +142,7 @@ searchers[0] = new IndexSearcher(indexStoreB); searchers[1] = new IndexSearcher(indexStoreA); // creating the multiSearcher - Searcher mSearcher = new MultiSearcher(searchers); + Searcher mSearcher = getMultiSearcherInstance(searchers); // performing the search Hits hits = mSearcher.search(query); @@ -171,7 +179,7 @@ searchers2[0] = new IndexSearcher(indexStoreB); searchers2[1] = new IndexSearcher(indexStoreA); // creating the mulitSearcher - Searcher mSearcher2 = new MultiSearcher(searchers2); + Searcher mSearcher2 = getMultiSearcherInstance(searchers2); // performing the same search Hits hits2 = mSearcher2.search(query); @@ -213,7 +221,7 @@ searchers3[0] = new IndexSearcher(indexStoreB); searchers3[1] = new IndexSearcher(indexStoreA); // creating the mulitSearcher - Searcher mSearcher3 = new MultiSearcher(searchers3); + Searcher mSearcher3 = getMultiSearcherInstance(searchers3); // performing the same search Hits hits3 = mSearcher3.search(query); Index: test/org/apache/lucene/search/TestParallelMultiSearcher.java =================================================================== RCS file: test/org/apache/lucene/search/TestParallelMultiSearcher.java diff -N test/org/apache/lucene/search/TestParallelMultiSearcher.java --- /dev/null 1 Jan 1970 00:00:00 -0000 +++ test/org/apache/lucene/search/TestParallelMultiSearcher.java 8 Jan 2004 18:00:02 -0000 @@ -0,0 +1,73 @@ +package org.apache.lucene.search; + +/* ==================================================================== + * The Apache Software License, Version 1.1 + * + * Copyright (c) 2001 The Apache Software Foundation. All rights + * reserved. + * + * Redistribution and use in source and binary forms, with or without + * modification, are permitted provided that the following conditions + * are met: + * + * 1. Redistributions of source code must retain the above copyright + * notice, this list of conditions and the following disclaimer. + * + * 2. Redistributions in binary form must reproduce the above copyright + * notice, this list of conditions and the following disclaimer in + * the documentation and/or other materials provided with the + * distribution. + * + * 3. The end-user documentation included with the redistribution, + * if any, must include the following acknowledgment: + * "This product includes software developed by the + * Apache Software Foundation (http://www.apache.org/)." + * Alternately, this acknowledgment may appear in the software itself, + * if and wherever such third-party acknowledgments normally appear. + * + * 4. The names "Apache" and "Apache Software Foundation" and + * "Apache Lucene" must not be used to endorse or promote products + * derived from this software without prior written permission. For + * written permission, please contact apache@apache.org. + * + * 5. Products derived from this software may not be called "Apache", + * "Apache Lucene", nor may "Apache" appear in their name, without + * prior written permission of the Apache Software Foundation. + * + * THIS SOFTWARE IS PROVIDED ``AS IS'' AND ANY EXPRESSED OR IMPLIED + * WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES + * OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE + * DISCLAIMED. IN NO EVENT SHALL THE APACHE SOFTWARE FOUNDATION OR + * ITS CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, + * SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT + * LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF + * USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND + * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, + * OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT + * OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF + * SUCH DAMAGE. + * ==================================================================== + * + * This software consists of voluntary contributions made by many + * individuals on behalf of the Apache Software Foundation. For more + * information on the Apache Software Foundation, please see + * . + */ + +import java.io.IOException; + +/** + * Unit tests for the ParallelMultiSearcher + */ +public class TestParallelMultiSearcher extends TestMultiSearcher { + + public TestParallelMultiSearcher(String name) { + super(name); + } + + protected MultiSearcher getMultiSearcherInstance(Searcher[] searchers) + throws IOException { + return new ParallelMultiSearcher(searchers); + } + +} -----Original Message----- From: Doug Cutting [mailto:cutting@apache.org] Sent: mercredi 7 janvier 2004 23:49 To: Lucene Developers List Subject: Re: Parallel search in MultiSearcher Jean-Francois Halleux wrote: > as suggested, here is a patch that should parallelize searches among the > various searchables in a MultiSearcher. I only have a single HDD machine but > I guess that spreading index files over multiple HDD can greatly improve > search time when using this MultiSearcher. > > This passes ANT test but be careful as it's been a while since I've > programmed and I don't know Lucene well. That was quick! I have two suggestions. First, I think it might be best if this is a subclass of MultiSearcher, so that the old sequential version is still available. Second, I think the following synchronization goes too far: > + synchronized (MultiSearcherThread.class) { > + if(!hq.insert(scoreDoc)) break; > + } // no more scores > minScore Wouldn't it be sufficient to synchronize on 'hq', e.g.: synchronized (hq) { if(!hq.insert(scoreDoc)) break; } Doug --------------------------------------------------------------------- To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org For additional commands, e-mail: lucene-dev-help@jakarta.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org For additional commands, e-mail: lucene-dev-help@jakarta.apache.org