hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ian Soboroff <ian.sobor...@nist.gov>
Subject Re: Creating Lucene index in Hadoop
Date Mon, 16 Mar 2009 20:52:32 GMT

Does anyone have stats on how multiple readers on an optimized Lucene
index in HDFS compares with a ParallelMultiReader (or whatever its
called) over RPC on a local filesystem?

I'm missing why you would ever want the Lucene index in HDFS for


Ning Li <ning.li.00@gmail.com> writes:

> I should have pointed out that Nutch index build and contrib/index
> targets different applications. The latter is for applications who
> simply want to build Lucene index from a set of documents - e.g. no
> link analysis.
> As to writing Lucene indexes, both work the same way - write the final
> results to local file system and then copy to HDFS. In contrib/index,
> the intermediate results are in memory and not written to HDFS.
> Hope it clarifies things.
> Cheers,
> Ning
> On Mon, Mar 16, 2009 at 2:57 PM, Ian Soboroff <ian.soboroff@nist.gov> wrote:
>> I understand why you would index in the reduce phase, because the anchor
>> text gets shuffled to be next to the document.  However, when you index
>> in the map phase, don't you just have to reindex later?
>> The main point to the OP is that HDFS is a bad FS for writing Lucene
>> indexes because of how Lucene works.  The simple approach is to write
>> your index outside of HDFS in the reduce phase, and then merge the
>> indexes from each reducer manually.
>> Ian
>> Ning Li <ning.li.00@gmail.com> writes:
>>> Or you can check out the index contrib. The difference of the two is that:
>>>   - In Nutch's indexing map/reduce job, indexes are built in the
>>> reduce phase. Afterwards, they are merged into smaller number of
>>> shards if necessary. The last time I checked, the merge process does
>>> not use map/reduce.
>>>   - In contrib/index, small indexes are built in the map phase. They
>>> are merged into the desired number of shards in the reduce phase. In
>>> addition, they can be merged into existing shards.
>>> Cheers,
>>> Ning
>>> On Fri, Mar 13, 2009 at 1:34 AM, 王红宝 <imcaptor@126.com> wrote:
>>>> you can see the nutch code.
>>>> 2009/3/13 Mark Kerzner <markkerzner@gmail.com>
>>>>> Hi,
>>>>> How do I allow multiple nodes to write to the same index file in HDFS?
>>>>> Thank you,
>>>>> Mark

View raw message