Return-Path: Delivered-To: apmail-jakarta-lucene-dev-archive@www.apache.org Received: (qmail 99677 invoked from network); 7 Jan 2004 20:19:39 -0000 Received: from daedalus.apache.org (HELO mail.apache.org) (208.185.179.12) by minotaur-2.apache.org with SMTP; 7 Jan 2004 20:19:39 -0000 Received: (qmail 34717 invoked by uid 500); 7 Jan 2004 20:19:27 -0000 Delivered-To: apmail-jakarta-lucene-dev-archive@jakarta.apache.org Received: (qmail 34551 invoked by uid 500); 7 Jan 2004 20:19:27 -0000 Mailing-List: contact lucene-dev-help@jakarta.apache.org; run by ezmlm Precedence: bulk List-Unsubscribe: List-Subscribe: List-Help: List-Post: List-Id: "Lucene Developers List" Reply-To: "Lucene Developers List" Delivered-To: mailing list lucene-dev@jakarta.apache.org Received: (qmail 34532 invoked from network); 7 Jan 2004 20:19:26 -0000 Received: from unknown (HELO nebula.skynet.be) (195.238.2.112) by daedalus.apache.org with SMTP; 7 Jan 2004 20:19:26 -0000 Received: from DELLLAT1L34N0J (10.193-200-80.adsl.skynet.be [80.200.193.10]) by nebula.skynet.be (8.12.9/8.12.9/Skynet-OUT-2.21) with SMTP id i07KJLGB027524 for ; Wed, 7 Jan 2004 21:19:22 +0100 (envelope-from ) Reply-To: From: =?us-ascii?Q?Jean-Francois_Halleux?= To: "Lucene Developers List" Subject: Parallel search in MultiSearcher Date: Wed, 7 Jan 2004 21:17:29 +0100 Message-ID: MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit X-Priority: 3 (Normal) X-MSMail-Priority: Normal X-Mailer: Microsoft Outlook IMO, Build 9.0.6604 (9.0.2911.0) X-MimeOLE: Produced By Microsoft MimeOLE V6.00.2800.1165 Importance: Normal In-Reply-To: <3FFC3EED.2070300@apache.org> X-RAVMilter-Version: 8.4.3(snapshot 20030212) (nebula.skynet.be) X-Spam-Rating: daedalus.apache.org 1.6.2 0/1000/N X-Spam-Rating: minotaur-2.apache.org 1.6.2 0/1000/N Hello, as suggested, here is a patch that should parallelize searches among the various searchables in a MultiSearcher. I only have a single HDD machine but I guess that spreading index files over multiple HDD can greatly improve search time when using this MultiSearcher. This passes ANT test but be careful as it's been a while since I've programmed and I don't know Lucene well. KR, Jean-Francois Halleux ---- Index: MultiSearcher.java =================================================================== RCS file: /home/cvspublic/jakarta-lucene/src/java/org/apache/lucene/search/MultiSearch er.java,v retrieving revision 1.12 diff -u -r1.12 MultiSearcher.java --- MultiSearcher.java 16 Sep 2003 20:06:32 -0000 1.12 +++ MultiSearcher.java 7 Jan 2004 20:14:57 -0000 @@ -141,29 +141,45 @@ return maxDoc; } - public TopDocs search(Query query, Filter filter, int nDocs) - throws IOException { - HitQueue hq = new HitQueue(nDocs); - int totalHits = 0; - - for (int i = 0; i < searchables.length; i++) { // search each searcher - TopDocs docs = searchables[i].search(query, filter, nDocs); - totalHits += docs.totalHits; // update totalHits - ScoreDoc[] scoreDocs = docs.scoreDocs; - for (int j = 0; j < scoreDocs.length; j++) { // merge scoreDocs into hq - ScoreDoc scoreDoc = scoreDocs[j]; - scoreDoc.doc += starts[i]; // convert doc - if(!hq.insert(scoreDoc)) - break; // no more scores > minScore - } - } - - ScoreDoc[] scoreDocs = new ScoreDoc[hq.size()]; - for (int i = hq.size()-1; i >= 0; i--) // put docs in array - scoreDocs[i] = (ScoreDoc)hq.pop(); + /** + * A search implementation which spans a new thread for each + * Searchable, waits for each search to complete and merge + * the results together + */ + public TopDocs search(Query query, Filter filter, int nDocs) + throws IOException { + HitQueue hq = new HitQueue(nDocs); + int totalHits = 0; + MultiSearcherThread[] msta=new MultiSearcherThread[searchables.length]; + + for (int i = 0; i < searchables.length; i++) { // search each searcher + msta[i]=new MultiSearcherThread(searchables[i], query, filter, + nDocs, hq, i, starts , "MultiSearcher thread #"+(i+1)); + msta[i].start(); + } + + for (int i=0; i < searchables.length; i++) { + try { + msta[i].join(); + } + catch (InterruptedException ie) { + ; // TODO: what should we do with this??? + } + IOException ioe=msta[i].getIOException(); + if (ioe==null) { + totalHits+=msta[i].hits(); + } else { + // if one search produced an IOException, rethrow it + throw ioe; + } + } + + ScoreDoc[] scoreDocs = new ScoreDoc[hq.size()]; + for (int i = hq.size()-1; i >= 0; i--) // put docs in array + scoreDocs[i] = (ScoreDoc)hq.pop(); - return new TopDocs(totalHits, scoreDocs); - } + return new TopDocs(totalHits, scoreDocs); + } /** Lower-level search API. @@ -208,4 +224,64 @@ return searchables[i].explain(query,doc-starts[i]); // dispatch to searcher } +} + +/** + * A thread subclass for searching a single searchable + */ +class MultiSearcherThread extends Thread { + + private Searchable searchable; + private Query query; + private Filter filter; + private int nDocs; + private int hits; + private TopDocs docs; + private int i; + private HitQueue hq; + private int[] starts; + private IOException ioe; + + public MultiSearcherThread(Searchable searchable, Query query, Filter filter, int nDocs, + HitQueue hq, int i, int[] starts, + String name) { + super(name); + this.searchable=searchable; + this.query=query; + this.filter=filter; + this.nDocs=nDocs; + this.hq=hq; + this.i=i; + this.starts=starts; + } + + public void run() { + try { + docs = searchable.search(query, filter, nDocs); + } + // Store the IOException for later use by the caller of this thread + catch (IOException ioe) { + this.ioe = ioe; + } + if (ioe==null) { + ScoreDoc[] scoreDocs = docs.scoreDocs; + for (int j = 0; j < scoreDocs.length; j++) { // merge scoreDocs into hq + ScoreDoc scoreDoc = scoreDocs[j]; + scoreDoc.doc += starts[i]; // convert doc + //it would be so nice if we had a thread-safe insert + synchronized (MultiSearcherThread.class) { + if(!hq.insert(scoreDoc)) break; + } // no more scores > minScore + } + } + } + + public int hits() { + return docs.totalHits; + } + + public IOException getIOException() { + return ioe; + } + } -----Original Message----- From: Doug Cutting [mailto:cutting@apache.org] Sent: mercredi 7 janvier 2004 18:16 To: Lucene Developers List Subject: Re: normalization BAD DESIGN ? Robert Engels wrote: > The design & implementation of the document/field normalization is very > poor. Thank you. > It requires a byte[] with as (number of documents * number of fields) > elements! That's correct and has been mentioned many times on this list. I've never had anyone complain about the amount of memory this uses. > With a document store of 100 million documents, with multiple fields, the > memory required is staggering. Is this a hypothetical case, or do you really have 100M documents that you hope to efficiently search from a machine with less than a few hundred megabytes of RAM? 100M is a very large collection, larger than any web search engine until just a few years ago. And today, a few hundred megabytes is not really that much RAM, hardly staggering. Most laptops now ship with enough RAM to do this. Collections with more than around 10M documents can become rather slow to search (> 1 second on average). So most folks who have 100M document collections will implement a parallel distributed solution. (This is, e.g., what web search engines do.) Multiple indexes are created, each for a subset of the entire collection, and all are searched in parallel. Unfortunately such distributed systems (like, e.g., Nutch's) tend not to be generic and none is included in Lucene. It's almost possible to build a parallel distributed search system by combining MultiSearcher with RemoteSearchable, except that MultiSearcher searches serially, not in parallel. But it would not be hard to modify MultiSearcher to implement parallel searching. (Volunteers?) > IndexReader has the following method definition, > > public abstract byte[] norms(String field) throws IOException; > > which is the source of the problem. > > Even returning null from this method does not help, as the PhraseScorer and > derived classes, maintain a reference, and do not perform a null check. > > I have modified 105 of PhraseScorer to be > > if(norms!=null) > score *= Similarity.decodeNorm(norms[first.doc]); // normalize That's a reasonable approach. If you don't want length normalization, then it should be possible to disable it. > Would it not be a better design, to define a method in IndexReader > > float getNorm(String fieldname,int docnum); > > so a implementation could cache this information in some fashion, or always > return 1.0 if it didn't care? The problem with this approach is that the innermost search loops would have to perform a hash table lookup on the field name, which would significantly impact performance. So the first approach, permitting IndexReader.norms() to return null, is preferable. If that winds up being useful to you, please submit a patch. Cheers, Doug --------------------------------------------------------------------- To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org For additional commands, e-mail: lucene-dev-help@jakarta.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org For additional commands, e-mail: lucene-dev-help@jakarta.apache.org