Mailing-List: contact general-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: general@hadoop.apache.org
Received-SPF: pass (nike.apache.org: local policy)
From: Scott Carey <scott@richrelevance.com>
To: "general@hadoop.apache.org" <general@hadoop.apache.org>
Date: Tue, 15 Sep 2009 16:59:45 -0700
Subject: Re: hadoop scales but is not performant?
Thread-Topic: hadoop scales but is not performant?
Thread-Index: Aco1ohy4R0PEAxULTzSBNeEZ+4RQ7AAtoMcSAAH+XeQ=
Message-ID: <C6D57881.11824%scott@richrelevance.com>
In-Reply-To: <C6D56B20.1180B%scott@richrelevance.com>
Accept-Language: en-US
Content-Language: en
acceptlanguage: en-US
Content-Type: multipart/alternative;
	boundary="_000_C6D5788111824scottrichrelevancecom_"
MIME-Version: 1.0

--_000_C6D5788111824scottrichrelevancecom_
Content-Type: text/plain; charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable

Some more comments:

How are you controlling your graph of different map reduce dependencies?  D=
oing this linearly will leave the cluster rather underutilized.  Using Pig =
or Cascading or the job control interface to schedule this all (which can s=
ubmit parallel independent jobs), combined with the Fair Scheduler, will us=
ually have very good results on smaller clusters in reducing overall batch =
latency and increasing cluster utilization.  This often allows setting the =
number of reduces for each job more optimally as well.

Although Pig will be a bit less performant than an optimized M/R, for tasks=
 like yours it can possibly go much faster with a lot less code by optimizi=
ng your M/R passes, scheduling, and providing some alternate join strategie=
s out of the box if used right.


On 9/15/09 4:02 PM, "Scott Carey" <scott@richrelevance.com> wrote:

Its not entirely invalid at this scale, though the dataset generally should=
 be (in memory) closer to the size of the available RAM of the cluster or m=
ore to be a better test.

Hadoop should still do well at this size relative to a single process.  If =
it is taking 45 minutes on a 10 machine cluster, it should be faster than a=
ny single process.  If this was limited primarily by startup costs, then in=
creasing the data size wouldn't almost linearly increase the time.

It can probably be significantly tuned.
Some tips for jobs this size:
Watch the CPU / RAM / IO on a node while it is running.  Is it being well u=
tilized?
Watch the jobs in the HTML UI on the jobtracker.  Are the total number of s=
lots configured being used most of the time?   Or is it spending a lot of c=
lock time with a lot of empty slots?  If so, which slots?  Are Maps or Redu=
ces taking the most time?

Hadoop currently is a slow scheduler for tasks (can schecule at most 1 task=
 per second or two per node).  Using the Fair Scheduler and enabling some o=
ptions can turn this up to 1 map and 1 reduce to schedule per 'tick'.  Ther=
e are some big changes in the schedulers due out in 0.21 that will signific=
antly help here.  For smaller low latency jobs this can make a big differen=
ce.

Hadoop also has some flaws in the shuffle phase that affect clusters of all=
 sizes, but can hurt small clusters with many small map jobs.   Look at the=
 log file output of your reduce jobs (in the jobtracker UI) and see how lon=
g the shuffle phase is taking.
There is a big change due for 0.21 that makes this a LOT faster for some ca=
ses, and low latency smaller jobs will benefit a lot too.
https://issues.apache.org/jira/browse/MAPREDUCE-318

See my comment in that ticket from the 10th of June in relation to shuffle =
times.  A one line change in 0.19.x and 0.20.x cut times on smaller low lat=
ency jobs on smaller clusters in half when limited by shuffle inefficiencie=
s (specifically, fetching no more than one map output from one host every f=
ew seconds for no good reason).

Tuning several hadoop parameters might help a great deal as well.   Clouder=
a's configuration wizard is a good starting point to help you find which kn=
obs are more important.


You probably also want to do some work to figure out what is taking time on=
 your local tests.  There might be something wrong there.  It seems suspici=
ous to me, even at the one document size, that it would take 2 + minutes ve=
rsus less than one second.  Startup costs aren't that big for 2 maps and 2 =
reduces - on the order of a few seconds not minutes.  A few seconds times a=
 few jobs is still not that much.


On 9/14/09 6:14 PM, "Mat Kelcey" <matthew.kelcey@gmail.com> wrote:

hi all,

recently i've started playing with hadoop and my first learning
experiment has surprised me

i'm implementing a a text problem; trying to extract infrequent
phrases using a mixture of probabilistic models.
it's a bit of toy problem so the algorithm details aren't super
important, though the implementation might be....

this problem ended up being represented by a dozen or so map reduce
jobs of various types including some aggregate steps and some manual
joins.
(see the bottom of this page for details
http://matpalm.com/sip/take3_markov_chains.html)

i've implemented each step using ruby / streaming, the code is at
http://github.com/matpalm/sip if anyone cares.

i ran some tests using 10 ec2 medium cpu instance across a small
100mb's of gzipped text.
for validation of the results i also reimplemented the entire
algorithm as a single threaded ruby app

my surprise comes from finding that the ruby implementation
outperforms the 10 ec2 instances on this data size...
i ran a few samples of different sizes with the graph at the bottom of
http://matpalm.com/sip/part4_but_does_it_scale.html

so why is this? here are my explanations in order of how confident i am...

a) 100mb is peanuts and hadoop was made for 1000x this size so the
test is invalid.
b) there is a better representation of this problem that uses fewer
map/reduce passes.
c) streaming is too slow and rewriting in java (and making use of
techniques like chaining mappers) would speed things up
d) doing these steps, particularly the joins, in pig would be faster

my next steps are to rewrite some of the steps in pig to sample the differe=
nce

does anyone have any high level comments on this?

cheers,
mat


--_000_C6D5788111824scottrichrelevancecom_--