hadoop-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mat Kelcey <matthew.kel...@gmail.com>
Subject Re: hadoop scales but is not performant?
Date Wed, 16 Sep 2009 03:37:55 GMT
thanks scott, some great things to think about!

the only "tuning" i did was to set mapred.reduce.tasks and
mapred.map.tasks to 30 to correspond to the capability specified by
the html ui. i admit i did this without a deep understanding what it
meant, i do know that when i did not specify these then only a few
mappers would be utilised (due to the same input data size)

in relation to scheduling i was taking the simple approach of running
the streaming jobs sequentially with the default scheduler. even from
watching output scroll past it is obvious that a _lot_ of time is
being taken up in setup related activities. this is most apparent in
the single document case. something is just not right...

i had read in http://issues.apache.org/jira/browse/HADOOP-2721 "Use
job control for tasks (and therefore for pipes and streaming)"  that
jobcontrol (specifically representing job dependencies) was not yet
available for streaming. as such i dismissed any scheduling changes.
i'll revisit this to make sure i understand what i can and can't do in
streaming. if nothing else i can try fairscheduling with my own rolled
version of dependencies. i'm orchestrating the job runs from rake and
i've got my own homebrew libraries for this type of dependency
management, though i'm also loath to roll my own versions of things.

so lots of ideas and things to check, i'll rerun trying some of the
things you've mentioned.

thanks again for the feedback!

View raw message