Return-Path: Delivered-To: apmail-hadoop-core-user-archive@www.apache.org Received: (qmail 55565 invoked from network); 16 Mar 2009 20:48:35 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 16 Mar 2009 20:48:35 -0000 Received: (qmail 86849 invoked by uid 500); 16 Mar 2009 20:48:28 -0000 Delivered-To: apmail-hadoop-core-user-archive@hadoop.apache.org Received: (qmail 86812 invoked by uid 500); 16 Mar 2009 20:48:28 -0000 Mailing-List: contact core-user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: core-user@hadoop.apache.org Delivered-To: mailing list core-user@hadoop.apache.org Received: (qmail 86800 invoked by uid 99); 16 Mar 2009 20:48:28 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 16 Mar 2009 13:48:28 -0700 X-ASF-Spam-Status: No, hits=-0.0 required=10.0 tests=SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of ning.li.00@gmail.com designates 209.85.217.170 as permitted sender) Received: from [209.85.217.170] (HELO mail-gx0-f170.google.com) (209.85.217.170) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 16 Mar 2009 20:48:20 +0000 Received: by gxk18 with SMTP id 18so5417156gxk.5 for ; Mon, 16 Mar 2009 13:47:59 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:received:in-reply-to:references :date:message-id:subject:from:to:content-type :content-transfer-encoding; bh=6sSNJ+ch/g1XsDR5EU0YogCm2M4p2a8B9Po3X53wisk=; b=xUImcN8IgHIeCktT+WvLYGgN+siqwgQv096oKKSzYhrTjisOHSQ+f+5HLcoYuzjRWq 5NNU/3XR+Tg71FLJ6aONVApABWrkQyI7oSUijd+9gCuEvk+f+21agac8OvI017D1muFU MiApEL27A5LnbgpYA230poBSNndHnw0FUNp/I= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type:content-transfer-encoding; b=nO/cQs7/gE5Wi1X8lCJ181yhTLdhYUtgMvoiJj7rB6E2Z3SSflqsg3+JSZ/MIz+8pC XEEaelithnNccU1aNd8p/aVfoUC63Ey6sLCueF2XDnbL5juCFkxS22st6g4fEHBl0j30 5yKTv+yBx+Wzrt294AAu8vSTIxny6w0n59iKE= MIME-Version: 1.0 Received: by 10.220.72.193 with SMTP id n1mr1784973vcj.31.1237236479065; Mon, 16 Mar 2009 13:47:59 -0700 (PDT) In-Reply-To: <9cfk56pfg81.fsf@rogue.ncsl.nist.gov> References: <8131791a0903122234g18eac67bqeb50a1dfcae361f2@mail.gmail.com> <9cfk56pfg81.fsf@rogue.ncsl.nist.gov> Date: Mon, 16 Mar 2009 16:47:59 -0400 Message-ID: Subject: Re: Creating Lucene index in Hadoop From: Ning Li To: core-user@hadoop.apache.org Content-Type: text/plain; charset=GB2312 Content-Transfer-Encoding: quoted-printable X-Virus-Checked: Checked by ClamAV on apache.org I should have pointed out that Nutch index build and contrib/index targets different applications. The latter is for applications who simply want to build Lucene index from a set of documents - e.g. no link analysis. As to writing Lucene indexes, both work the same way - write the final results to local file system and then copy to HDFS. In contrib/index, the intermediate results are in memory and not written to HDFS. Hope it clarifies things. Cheers, Ning On Mon, Mar 16, 2009 at 2:57 PM, Ian Soboroff wrote= : > > I understand why you would index in the reduce phase, because the anchor > text gets shuffled to be next to the document. However, when you index > in the map phase, don't you just have to reindex later? > > The main point to the OP is that HDFS is a bad FS for writing Lucene > indexes because of how Lucene works. The simple approach is to write > your index outside of HDFS in the reduce phase, and then merge the > indexes from each reducer manually. > > Ian > > Ning Li writes: > >> Or you can check out the index contrib. The difference of the two is tha= t: >> - In Nutch's indexing map/reduce job, indexes are built in the >> reduce phase. Afterwards, they are merged into smaller number of >> shards if necessary. The last time I checked, the merge process does >> not use map/reduce. >> - In contrib/index, small indexes are built in the map phase. They >> are merged into the desired number of shards in the reduce phase. In >> addition, they can be merged into existing shards. >> >> Cheers, >> Ning >> >> >> On Fri, Mar 13, 2009 at 1:34 AM, =CD=F5=BA=EC=B1=A6 w= rote: >>> you can see the nutch code. >>> >>> 2009/3/13 Mark Kerzner >>> >>>> Hi, >>>> >>>> How do I allow multiple nodes to write to the same index file in HDFS? >>>> >>>> Thank you, >>>> Mark >>>> >>> > >