hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Trevor Harmon <tre...@vocaro.com>
Subject Is it wrong to bypass HDFS?
Date Sun, 09 Nov 2014 17:16:20 GMT

I’m trying to model an "embarrassingly parallel" problem as a map-reduce job. The amount
of data is small -- about 100MB per job, and about 0.25MB per work item -- but the reduce
phase is very CPU-intensive, requiring about 30 seconds to reduce each mapper's output to
a single value. The goal is to speed up the computation by distributing the tasks across many

I am not sure how the mappers would work in this scenario. My initial thought was that there
would be one mapper per reducer, and each mapper would fetch its input directly from the source
database, using an input key provided by Hadoop. (Remember it’s only about 0.25MB per work
item.) It would then do some necessary fix-up and massaging of the data to prepare it for
the reduction phase.

However, none of the tutorials and example code I’ve seen do it this way. They always copy
the data from the source database to HDFS first. For my use case, this seems wasteful. The
data per task is very small and can fit entirely in the mapper’s and reducer’s main memory,
so I don’t need “big data” redundant storage. Also, the data is read only once per task,
so there’s nothing to be gained by the data locality optimizations of HDFS. Having to copy
the data to an intermediate data store seems unnecessary and just adds overhead in this case.

Is it okay to bypass HDFS for certain types of problems, such as this one? Or is there some
reason mappers should never perform external I/O? I am very new to Hadoop so I don’t have
much experience to go on here. Thank you,


View raw message