hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Todd Lipcon <t...@cloudera.com>
Subject Re: Optimal setup for a test problem
Date Mon, 12 Apr 2010 17:53:56 GMT
Hi Andrew,

Do you need the sorting behavior that having an identity reducer gives you?
If not, set the number of reduce tasks to 0 and you'll end up with a map
only job, which should be significantly faster.


On Mon, Apr 12, 2010 at 9:43 AM, Andrew Nguyen <
andrew-lists-hadoop@ucsfcti.org> wrote:

> Hello,
> I recently setup a 5 node cluster (1 master, 4 slaves) and am looking to
> use it to process high volumes of patient physiologic data.  As an initial
> exercise to gain a better understanding, I have attempted to run the
> following problem (which isn't the type of problem that Hadoop was really
> designed for, as is my understanding).
> I have a 6G data file, that contains key/value of <sample number, sample
> value>.  I'd like to convert the values based on a gain/offset to their
> physical units.  I've setup a MapReduce job using streaming where the mapper
> does the conversion, and the reducer is just an identity reducer.  Based on
> other threads on the mailing list, my initial results are consistent in the
> fact that it takes considerably more time to process this in Hadoop then it
> is on my Macbook pro (45 minutes vs. 13 minutes).  The input is a single 6G
> file and it looks like the file is being split into 101 map tasks.  This is
> consistent with the 64M block sizes.
> So my questions are:
> * Would it help to increase the block size to 128M?  Or, decrease the block
> size?  What are some key factors to think about with this question?
> * Are there any other optimizations that I could employ?  I have looked
> into LzoCompression but I'd like to still work without compression since the
> single thread job that I'm comparing to doesn't use any sort of compression.
>  I know I'm comparing apples to pears a little here so please feel free to
> correct this assumption.
> * Is Hadoop really only good for jobs where the data doesn't fit on a
> single node?  At some level, I assume that it can still speedup jobs that do
> fit on one node, if only because you are performing tasks in parallel.
> Thanks!
> --Andrew

Todd Lipcon
Software Engineer, Cloudera

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message