Mailing-List: contact core-user-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: core-user@hadoop.apache.org
Received-SPF: pass (nike.apache.org: domain of ning.li.00@gmail.com designates
 209.85.217.170 as permitted sender)
DomainKey-Signature: a=rsa-sha1; c=nofws;
        d=gmail.com; s=gamma;
        h=mime-version:in-reply-to:references:date:message-id:subject:from:to
         :content-type:content-transfer-encoding;
        b=nO/cQs7/gE5Wi1X8lCJ181yhTLdhYUtgMvoiJj7rB6E2Z3SSflqsg3+JSZ/MIz+8pC
         XEEaelithnNccU1aNd8p/aVfoUC63Ey6sLCueF2XDnbL5juCFkxS22st6g4fEHBl0j30
         5yKTv+yBx+Wzrt294AAu8vSTIxny6w0n59iKE=
MIME-Version: 1.0
In-Reply-To: <9cfk56pfg81.fsf@rogue.ncsl.nist.gov>
References: <c9b0d8bd0903122138u6d6c8babv53dbde1474ea309e@mail.gmail.com>
	 <8131791a0903122234g18eac67bqeb50a1dfcae361f2@mail.gmail.com>
	 <d32473880903131210t2b59cfe5q7f58bee55d764931@mail.gmail.com>
	 <9cfk56pfg81.fsf@rogue.ncsl.nist.gov>
Date: Mon, 16 Mar 2009 16:47:59 -0400
Message-ID: <d32473880903161347p42ad2148q8c8f5681213b477c@mail.gmail.com>
Subject: Re: Creating Lucene index in Hadoop
From: Ning Li <ning.li.00@gmail.com>
To: core-user@hadoop.apache.org
Content-Type: text/plain; charset=GB2312
Content-Transfer-Encoding: quoted-printable

I should have pointed out that Nutch index build and contrib/index
targets different applications. The latter is for applications who
simply want to build Lucene index from a set of documents - e.g. no
link analysis.

As to writing Lucene indexes, both work the same way - write the final
results to local file system and then copy to HDFS. In contrib/index,
the intermediate results are in memory and not written to HDFS.

Hope it clarifies things.

Cheers,
Ning


On Mon, Mar 16, 2009 at 2:57 PM, Ian Soboroff <ian.soboroff@nist.gov> wrote=
:
>
> I understand why you would index in the reduce phase, because the anchor
> text gets shuffled to be next to the document.  However, when you index
> in the map phase, don't you just have to reindex later?
>
> The main point to the OP is that HDFS is a bad FS for writing Lucene
> indexes because of how Lucene works.  The simple approach is to write
> your index outside of HDFS in the reduce phase, and then merge the
> indexes from each reducer manually.
>
> Ian
>
> Ning Li <ning.li.00@gmail.com> writes:
>
>> Or you can check out the index contrib. The difference of the two is tha=
t:
>>   - In Nutch's indexing map/reduce job, indexes are built in the
>> reduce phase. Afterwards, they are merged into smaller number of
>> shards if necessary. The last time I checked, the merge process does
>> not use map/reduce.
>>   - In contrib/index, small indexes are built in the map phase. They
>> are merged into the desired number of shards in the reduce phase. In
>> addition, they can be merged into existing shards.
>>
>> Cheers,
>> Ning
>>
>>
>> On Fri, Mar 13, 2009 at 1:34 AM, =CD=F5=BA=EC=B1=A6 <imcaptor@126.com> w=
rote:
>>> you can see the nutch code.
>>>
>>> 2009/3/13 Mark Kerzner <markkerzner@gmail.com>
>>>
>>>> Hi,
>>>>
>>>> How do I allow multiple nodes to write to the same index file in HDFS?
>>>>
>>>> Thank you,
>>>> Mark
>>>>
>>>
>
>