hadoop-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Scott Carey <sc...@richrelevance.com>
Subject Re: hadoop scales but is not performant?
Date Tue, 15 Sep 2009 23:59:45 GMT
Some more comments:

How are you controlling your graph of different map reduce dependencies?  Doing this linearly
will leave the cluster rather underutilized.  Using Pig or Cascading or the job control interface
to schedule this all (which can submit parallel independent jobs), combined with the Fair
Scheduler, will usually have very good results on smaller clusters in reducing overall batch
latency and increasing cluster utilization.  This often allows setting the number of reduces
for each job more optimally as well.

Although Pig will be a bit less performant than an optimized M/R, for tasks like yours it
can possibly go much faster with a lot less code by optimizing your M/R passes, scheduling,
and providing some alternate join strategies out of the box if used right.

On 9/15/09 4:02 PM, "Scott Carey" <scott@richrelevance.com> wrote:

Its not entirely invalid at this scale, though the dataset generally should be (in memory)
closer to the size of the available RAM of the cluster or more to be a better test.

Hadoop should still do well at this size relative to a single process.  If it is taking 45
minutes on a 10 machine cluster, it should be faster than any single process.  If this was
limited primarily by startup costs, then increasing the data size wouldn't almost linearly
increase the time.

It can probably be significantly tuned.
Some tips for jobs this size:
Watch the CPU / RAM / IO on a node while it is running.  Is it being well utilized?
Watch the jobs in the HTML UI on the jobtracker.  Are the total number of slots configured
being used most of the time?   Or is it spending a lot of clock time with a lot of empty slots?
 If so, which slots?  Are Maps or Reduces taking the most time?

Hadoop currently is a slow scheduler for tasks (can schecule at most 1 task per second or
two per node).  Using the Fair Scheduler and enabling some options can turn this up to 1 map
and 1 reduce to schedule per 'tick'.  There are some big changes in the schedulers due out
in 0.21 that will significantly help here.  For smaller low latency jobs this can make a big

Hadoop also has some flaws in the shuffle phase that affect clusters of all sizes, but can
hurt small clusters with many small map jobs.   Look at the log file output of your reduce
jobs (in the jobtracker UI) and see how long the shuffle phase is taking.
There is a big change due for 0.21 that makes this a LOT faster for some cases, and low latency
smaller jobs will benefit a lot too.

See my comment in that ticket from the 10th of June in relation to shuffle times.  A one line
change in 0.19.x and 0.20.x cut times on smaller low latency jobs on smaller clusters in half
when limited by shuffle inefficiencies (specifically, fetching no more than one map output
from one host every few seconds for no good reason).

Tuning several hadoop parameters might help a great deal as well.   Cloudera's configuration
wizard is a good starting point to help you find which knobs are more important.

You probably also want to do some work to figure out what is taking time on your local tests.
 There might be something wrong there.  It seems suspicious to me, even at the one document
size, that it would take 2 + minutes versus less than one second.  Startup costs aren't that
big for 2 maps and 2 reduces - on the order of a few seconds not minutes.  A few seconds times
a few jobs is still not that much.

On 9/14/09 6:14 PM, "Mat Kelcey" <matthew.kelcey@gmail.com> wrote:

hi all,

recently i've started playing with hadoop and my first learning
experiment has surprised me

i'm implementing a a text problem; trying to extract infrequent
phrases using a mixture of probabilistic models.
it's a bit of toy problem so the algorithm details aren't super
important, though the implementation might be....

this problem ended up being represented by a dozen or so map reduce
jobs of various types including some aggregate steps and some manual
(see the bottom of this page for details

i've implemented each step using ruby / streaming, the code is at
http://github.com/matpalm/sip if anyone cares.

i ran some tests using 10 ec2 medium cpu instance across a small
100mb's of gzipped text.
for validation of the results i also reimplemented the entire
algorithm as a single threaded ruby app

my surprise comes from finding that the ruby implementation
outperforms the 10 ec2 instances on this data size...
i ran a few samples of different sizes with the graph at the bottom of

so why is this? here are my explanations in order of how confident i am...

a) 100mb is peanuts and hadoop was made for 1000x this size so the
test is invalid.
b) there is a better representation of this problem that uses fewer
map/reduce passes.
c) streaming is too slow and rewriting in java (and making use of
techniques like chaining mappers) would speed things up
d) doing these steps, particularly the joins, in pig would be faster

my next steps are to rewrite some of the steps in pig to sample the difference

does anyone have any high level comments on this?


  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message