Return-Path: X-Original-To: apmail-lucene-java-user-archive@www.apache.org Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id D58499B40 for ; Thu, 22 Dec 2011 08:17:32 +0000 (UTC) Received: (qmail 64066 invoked by uid 500); 22 Dec 2011 08:17:30 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 64019 invoked by uid 500); 22 Dec 2011 08:17:30 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 64008 invoked by uid 99); 22 Dec 2011 08:17:27 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 22 Dec 2011 08:17:27 +0000 X-ASF-Spam-Status: No, hits=-0.0 required=5.0 tests=SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of goksron@gmail.com designates 209.85.210.176 as permitted sender) Received: from [209.85.210.176] (HELO mail-iy0-f176.google.com) (209.85.210.176) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 22 Dec 2011 08:17:21 +0000 Received: by iapp10 with SMTP id p10so18051709iap.35 for ; Thu, 22 Dec 2011 00:17:00 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type:content-transfer-encoding; bh=WuQ2K0eM/YsL6UoFsibDyHET45+Fx1nOzXjlNxo67WA=; b=fYevoZmpJGBiS5FxAb4C2U3LfQ+b3ntE7noYMw9XuPW5hxcJhSFwWEi2RN4e4KHBMw 2f5hOy4/XB5xciDLXW/9ZcbhIECCPrn3QIegVhnSs1JipoPYouPLoBa2ha09hAcPc259 u53WXDFztVO2UtT7LacLFxSvFPRfGa5hKYy8s= MIME-Version: 1.0 Received: by 10.50.47.136 with SMTP id d8mr7555499ign.21.1324541819503; Thu, 22 Dec 2011 00:16:59 -0800 (PST) Received: by 10.50.203.71 with HTTP; Thu, 22 Dec 2011 00:16:59 -0800 (PST) In-Reply-To: <320C27C8-382D-4662-A75B-CF11B786A9CA@hoplahup.net> References: <320C27C8-382D-4662-A75B-CF11B786A9CA@hoplahup.net> Date: Thu, 22 Dec 2011 00:16:59 -0800 Message-ID: Subject: Re: Retrieving large numbers of documents from several disks in parallel From: Lance Norskog To: java-user@lucene.apache.org Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable X-Virus-Checked: Checked by ClamAV on apache.org Is each index optimized? >From my vague grasp of Lucene file formats, I think you want to sort the documents by segment document id, which is the order of documents on the disk. This lets you materialize documents in their order on the disk. Solr (and other apps) generally use a separate thread per task and separate index reading classes (not sure which any more). As to the cold-start, how many terms are there? You are loading them into the field cache, right? Solr has a feature called "auto-warming" which automatically runs common queries each time it reopens an index. On Wed, Dec 21, 2011 at 11:11 PM, Paul Libbrecht wrote: > Michael, > > from a physical point of view, it would seem like the order in which the = documents are read is very significant for the reading speed (feel the rand= om access jump as being the issue). > > You could: > - move to ram-disk or ssd to make a difference? > - use something different than a searcher which might be doing it better = (pure speculation: does a hit-collector make a difference?) > > hope it helps. > > paul > > > Le 22 d=C3=A9c. 2011 =C3=A0 03:45, Robert Bart a =C3=A9crit : > >> Hi All, >> >> >> I am running Lucene 3.4 in an application that indexes about 1 billion >> factual assertions (Documents) from the web over four separate disks, so >> that each disk has a separate index of about 250 million documents. The >> Documents are relatively small, less than 1KB each. These indexes provid= e >> data to our web demo (http://openie.cs.washington.edu), where a typical >> search needs to retrieve and materialize as many as 3,000 Documents from >> each index in order to display a page of results to the user. >> >> >> In the worst case, a new, uncached query takes around 30 seconds to >> complete, with all four disks IO bottlenecked during most of this time. = My >> implementation uses a separate Thread per disk to (1) call >> IndexSearcher.search(Query query, Filter filter, int n) and (2) process = the >> Documents returned from IndexSearcher.doc(int). Since 30 seconds seems l= ike >> a long time to retrieve 3,000 small Documents, I am wondering if I am >> overlooking something simple somewhere. >> >> >> Is there a better method for retrieving documents in bulk? >> >> >> Is there a better way of parallelizing indexes from separate disks than = to >> use a MultiReader (which doesn=E2=80=99t seem to parallelize the task of >> materializing Documents) >> >> >> Any other suggestions? I have tried some of the basic ideas on the Lucen= e >> wiki, such as leaving the IndexSearcher open for the life of the process= (a >> servlet). Any help would be greatly appreciated! >> >> >> Rob > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org > For additional commands, e-mail: java-user-help@lucene.apache.org > --=20 Lance Norskog goksron@gmail.com --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org For additional commands, e-mail: java-user-help@lucene.apache.org