Mailing-List: contact lucene-dev-help@jakarta.apache.org; run by ezmlm
Precedence: bulk
Reply-To: "Lucene Developers List" <lucene-dev@jakarta.apache.org>
Reply-To: <halleux.jf@skynet.be>
From: =?us-ascii?Q?Jean-Francois_Halleux?= <halleux.jf@skynet.be>
To: "Lucene Developers List" <lucene-dev@jakarta.apache.org>
Subject: Parallel search in MultiSearcher
Date: Wed, 7 Jan 2004 21:17:29 +0100
Message-ID: <AAEFIPDADDCEPDAIBEPJMEJHFHAA.halleux.jf@skynet.be>
MIME-Version: 1.0
Content-Type: text/plain;
	charset="us-ascii"
Content-Transfer-Encoding: 7bit
Importance: Normal
In-Reply-To: <3FFC3EED.2070300@apache.org>

Hello,

	as suggested, here is a patch that should parallelize searches among the
various searchables in a MultiSearcher. I only have a single HDD machine but
I guess that spreading index files over multiple HDD can greatly improve
search time when using this MultiSearcher.

This passes ANT test but be careful as it's been a while since I've
programmed and I don't know Lucene well.

KR,

Jean-Francois Halleux

----

Index: MultiSearcher.java
===================================================================
RCS file:
/home/cvspublic/jakarta-lucene/src/java/org/apache/lucene/search/MultiSearch
er.java,v
retrieving revision 1.12
diff -u -r1.12 MultiSearcher.java
--- MultiSearcher.java	16 Sep 2003 20:06:32 -0000	1.12
+++ MultiSearcher.java	7 Jan 2004 20:14:57 -0000
@@ -141,29 +141,45 @@
     return maxDoc;
   }

-  public TopDocs search(Query query, Filter filter, int nDocs)
-      throws IOException {
-    HitQueue hq = new HitQueue(nDocs);
-    int totalHits = 0;
-
-    for (int i = 0; i < searchables.length; i++) { // search each searcher
-      TopDocs docs = searchables[i].search(query, filter, nDocs);
-      totalHits += docs.totalHits;		  // update totalHits
-      ScoreDoc[] scoreDocs = docs.scoreDocs;
-      for (int j = 0; j < scoreDocs.length; j++) { // merge scoreDocs into
hq
-	ScoreDoc scoreDoc = scoreDocs[j];
-        scoreDoc.doc += starts[i];                      // convert doc
-        if(!hq.insert(scoreDoc))
-            break;                                                // no
more scores > minScore
-      }
-    }
-
-    ScoreDoc[] scoreDocs = new ScoreDoc[hq.size()];
-    for (int i = hq.size()-1; i >= 0; i--)	  // put docs in array
-      scoreDocs[i] = (ScoreDoc)hq.pop();
+  /**
+  * A search implementation which spans a new thread for each
+  * Searchable, waits for each search to complete and merge
+  * the results together
+  */
+ public TopDocs search(Query query, Filter filter, int nDocs)
+	 throws IOException {
+   HitQueue hq = new HitQueue(nDocs);
+   int totalHits = 0;
+   MultiSearcherThread[] msta=new MultiSearcherThread[searchables.length];
+
+   for (int i = 0; i < searchables.length; i++) { // search each searcher
+	 msta[i]=new MultiSearcherThread(searchables[i], query, filter,
+									 nDocs, hq, i, starts , "MultiSearcher thread #"+(i+1));
+	 msta[i].start();
+   }
+
+   for (int i=0; i < searchables.length; i++) {
+	   try {
+		   msta[i].join();
+	   }
+	   catch (InterruptedException ie) {
+		   ; // TODO: what should we do with this???
+	   }
+	   IOException ioe=msta[i].getIOException();
+	   if (ioe==null) {
+		   totalHits+=msta[i].hits();
+	   } else {
+		   // if one search produced an IOException, rethrow it
+		   throw ioe;
+	   }
+   }
+
+   ScoreDoc[] scoreDocs = new ScoreDoc[hq.size()];
+   for (int i = hq.size()-1; i >= 0; i--)	  // put docs in array
+	 scoreDocs[i] = (ScoreDoc)hq.pop();

-    return new TopDocs(totalHits, scoreDocs);
-  }
+   return new TopDocs(totalHits, scoreDocs);
+ }


   /** Lower-level search API.
@@ -208,4 +224,64 @@
     return searchables[i].explain(query,doc-starts[i]); // dispatch to
searcher
   }

+}
+
+/**
+ * A thread subclass for searching a single searchable
+ */
+class MultiSearcherThread extends Thread {
+
+	private Searchable searchable;
+	private Query query;
+	private Filter filter;
+	private int nDocs;
+	private int hits;
+	private TopDocs docs;
+	private int i;
+	private HitQueue hq;
+	private int[] starts;
+	private IOException ioe;
+
+	public MultiSearcherThread(Searchable searchable, Query query, Filter
filter, int nDocs,
+								HitQueue hq, int i, int[] starts,
+								String name) {
+		super(name);
+		this.searchable=searchable;
+		this.query=query;
+		this.filter=filter;
+		this.nDocs=nDocs;
+		this.hq=hq;
+		this.i=i;
+		this.starts=starts;
+	}
+
+	public void run() {
+		try {
+			docs = searchable.search(query, filter, nDocs);
+		}
+		// Store the IOException for later use by the caller of this thread
+		catch (IOException ioe) {
+			this.ioe = ioe;
+		}
+		if (ioe==null) {
+			ScoreDoc[] scoreDocs = docs.scoreDocs;
+			for (int j = 0; j < scoreDocs.length; j++) { // merge scoreDocs into hq
+			   ScoreDoc scoreDoc = scoreDocs[j];
+			   scoreDoc.doc += starts[i]; // convert doc
+			   //it would be so nice if we had a thread-safe insert
+			   synchronized (MultiSearcherThread.class) {
+				   if(!hq.insert(scoreDoc)) break;
+			   }                              // no more scores > minScore
+			}
+		}
+	}
+
+	public int hits() {
+		return docs.totalHits;
+	}
+
+	public IOException getIOException() {
+		return ioe;
+	}
+
 }


-----Original Message-----
From: Doug Cutting [mailto:cutting@apache.org]
Sent: mercredi 7 janvier 2004 18:16
To: Lucene Developers List
Subject: Re: normalization BAD DESIGN ?


Robert Engels wrote:
> The design & implementation of the document/field normalization is very
> poor.

Thank you.

> It requires a byte[] with as (number of documents * number of fields)
> elements!

That's correct and has been mentioned many times on this list.  I've
never had anyone complain about the amount of memory this uses.

> With a document store of 100 million documents, with multiple fields, the
> memory required is staggering.

Is this a hypothetical case, or do you really have 100M documents that
you hope to efficiently search from a machine with less than a few
hundred megabytes of RAM?  100M is a very large collection, larger than
any web search engine until just a few years ago.  And today, a few
hundred megabytes is not really that much RAM, hardly staggering.  Most
laptops now ship with enough RAM to do this.

Collections with more than around 10M documents can become rather slow
to search (> 1 second on average).  So most folks who have 100M document
collections will implement a parallel distributed solution.  (This is,
e.g., what web search engines do.)  Multiple indexes are created, each
for a subset of the entire collection, and all are searched in parallel.
  Unfortunately such distributed systems (like, e.g., Nutch's) tend not
to be generic and none is included in Lucene.  It's almost possible to
build a parallel distributed search system by combining MultiSearcher
with RemoteSearchable, except that MultiSearcher searches serially, not
in parallel.  But it would not be hard to modify MultiSearcher to
implement parallel searching.  (Volunteers?)

> IndexReader has the following method definition,
>
> public abstract byte[] norms(String field) throws IOException;
>
> which is the source of the problem.
>
> Even returning null from this method does not help, as the PhraseScorer
and
> derived classes, maintain a reference, and do not perform a null check.
>
> I have modified 105 of PhraseScorer to be
>
> if(norms!=null)
>     score *= Similarity.decodeNorm(norms[first.doc]); // normalize

That's a reasonable approach.  If you don't want length normalization,
then it should be possible to disable it.

> Would it not be a better design, to define a method in IndexReader
>
> float getNorm(String fieldname,int docnum);
>
> so a implementation could cache this information in some fashion, or
always
> return 1.0 if it didn't care?

The problem with this approach is that the innermost search loops would
have to perform a hash table lookup on the field name, which would
significantly impact performance.  So the first approach, permitting
IndexReader.norms() to return null, is preferable.  If that winds up
being useful to you, please submit a patch.

Cheers,

Doug


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org