hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Frank Scholten <fr...@frankscholten.nl>
Subject Distributed Indexing on MapReduce
Date Thu, 01 Mar 2012 10:08:13 GMT
Hi all,

I am looking into reusing some existing code for distributed indexing
to test a Mahout tool I am working on
https://issues.apache.org/jira/browse/MAHOUT-944

What I want is to index the Apache Public Mail Archives dataset (200G)
via MapReduce on Hadoop.

I have been going through the Nutch and contrib/index code and from my
understanding I have to:

* Create an InputFormat / RecordReader / InputSplit class for
splitting the e-mails across mappers
* Create a Mapper which emits the e-mails as key value pairs
* Create a Reducer which indexes the e-mails on the local filesystem
(or straight to HDFS?)
* Copy these indexes from local filesystem to HDFS. In the same Reducer?

I am unsure about the final steps. How to get to the end result, a
bunch of index shards on HDFS. It seems
that each Reducer needs to be aware of a directory they eventually
write to on HDFS. I don't see how to get each reducer to copy its
shard to HDFS

How do I set this up?

Cheers,

Frank

Mime
View raw message