Return-Path: Delivered-To: apmail-hadoop-common-user-archive@www.apache.org Received: (qmail 79996 invoked from network); 9 Jul 2009 16:58:23 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 9 Jul 2009 16:58:23 -0000 Received: (qmail 27684 invoked by uid 500); 9 Jul 2009 16:58:31 -0000 Delivered-To: apmail-hadoop-common-user-archive@hadoop.apache.org Received: (qmail 27594 invoked by uid 500); 9 Jul 2009 16:58:31 -0000 Mailing-List: contact common-user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: common-user@hadoop.apache.org Delivered-To: mailing list common-user@hadoop.apache.org Received: (qmail 27583 invoked by uid 99); 9 Jul 2009 16:58:30 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 09 Jul 2009 16:58:30 +0000 X-ASF-Spam-Status: No, hits=2.2 required=10.0 tests=HTML_MESSAGE,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of ted.dunning@gmail.com designates 209.85.217.215 as permitted sender) Received: from [209.85.217.215] (HELO mail-gx0-f215.google.com) (209.85.217.215) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 09 Jul 2009 16:58:22 +0000 Received: by gxk11 with SMTP id 11so433169gxk.5 for ; Thu, 09 Jul 2009 09:58:02 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:received:in-reply-to:references :from:date:message-id:subject:to:content-type; bh=x52xT+OtazWg5j9Hnq7fWHjkSRw3/Tf5SUF/TH6dZdE=; b=mK7Fljg4kRXvqZKCDTyDmAwSSWpuEYcCCA4byOsd3kLTjXUJy3vfpcrhhcJLpvZwUo Fze7SaPRzo1RmPDeZ3VJj7uDVzrSuIwbi8YJI8kGaAU8hgGu2IPVKGYBge++xjser62d fjy96R7CZMjOxvhUyLD2uMBSBZK02tbCZMR6A= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :content-type; b=wsOqbpreL8nufHqD/327ALq3ohPZLq17zamMO3Xl+XY8RUyRGBMuj1sEIWsak03W34 jhcaLTLr429hNXJmoB92eCwPLlrMEJ/QRiUg/Z/nWsHTvgvb3VBY83C9XQMB3iwCr94F MZIpHDCtDxQzjmfKnuNXVtabcQAfOST0DOWxU= MIME-Version: 1.0 Received: by 10.150.189.21 with SMTP id m21mr1505995ybf.21.1247158682078; Thu, 09 Jul 2009 09:58:02 -0700 (PDT) In-Reply-To: References: From: Ted Dunning Date: Thu, 9 Jul 2009 09:57:42 -0700 Message-ID: Subject: Re: Lucene index creation using Hadoop To: common-user@hadoop.apache.org Content-Type: multipart/alternative; boundary=000e0cd6acd8394ea4046e48c203 X-Virus-Checked: Checked by ClamAV on apache.org --000e0cd6acd8394ea4046e48c203 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit Exactly as we do. Also, I find that with a large enough collection to care about speed that we have many more shards than we have reducers so parallelism in indexing is nearly perfect. On Thu, Jul 9, 2009 at 9:13 AM, Ken Krugler wrote: > > We wind up with one index (shard) per reducer, so by controlling the number > of reducers we can vary the shard count, down to a minimum count == the > number of slaves in the processing cluster. -- Ted Dunning, CTO DeepDyve --000e0cd6acd8394ea4046e48c203--