hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chris Douglas <cdoug...@apache.org>
Subject Re: Best practices for jobs with large Map output
Date Tue, 19 Apr 2011 18:47:36 GMT
On Mon, Apr 18, 2011 at 3:42 AM, Shai Erera <serera@gmail.com> wrote:
> I ended up doing the following -- my HDFS Mapper creates an index in-memory
> and then serializes the in-memory index into a single file that is stored on
> HDFS (each Mapper serializes to a different file). I use FileSystem API to
> achieve that, so hopefully it's the way to do it. The Mapper outputs a Text
> value which is the location on HDFS. The Reducer then interprets that value
> and reads the file using FileSystem API, and deserialize it into an
> in-memory Lucene index.

Without knowing the format of a Lucene index, I can't say whether this
approach makes sense. Instead of handling the cleanup yourself, you
might consider running the index generation and the concat as separate
parts of your workflow (as Harsh suggested). -C

View raw message