hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ning Li <ning.li...@gmail.com>
Subject Re: Creating Lucene index in Hadoop
Date Mon, 16 Mar 2009 20:47:59 GMT
I should have pointed out that Nutch index build and contrib/index
targets different applications. The latter is for applications who
simply want to build Lucene index from a set of documents - e.g. no
link analysis.

As to writing Lucene indexes, both work the same way - write the final
results to local file system and then copy to HDFS. In contrib/index,
the intermediate results are in memory and not written to HDFS.

Hope it clarifies things.


On Mon, Mar 16, 2009 at 2:57 PM, Ian Soboroff <ian.soboroff@nist.gov> wrote:
> I understand why you would index in the reduce phase, because the anchor
> text gets shuffled to be next to the document.  However, when you index
> in the map phase, don't you just have to reindex later?
> The main point to the OP is that HDFS is a bad FS for writing Lucene
> indexes because of how Lucene works.  The simple approach is to write
> your index outside of HDFS in the reduce phase, and then merge the
> indexes from each reducer manually.
> Ian
> Ning Li <ning.li.00@gmail.com> writes:
>> Or you can check out the index contrib. The difference of the two is that:
>>   - In Nutch's indexing map/reduce job, indexes are built in the
>> reduce phase. Afterwards, they are merged into smaller number of
>> shards if necessary. The last time I checked, the merge process does
>> not use map/reduce.
>>   - In contrib/index, small indexes are built in the map phase. They
>> are merged into the desired number of shards in the reduce phase. In
>> addition, they can be merged into existing shards.
>> Cheers,
>> Ning
>> On Fri, Mar 13, 2009 at 1:34 AM, 王红宝 <imcaptor@126.com> wrote:
>>> you can see the nutch code.
>>> 2009/3/13 Mark Kerzner <markkerzner@gmail.com>
>>>> Hi,
>>>> How do I allow multiple nodes to write to the same index file in HDFS?
>>>> Thank you,
>>>> Mark

View raw message