hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Stephen Boesch <java...@gmail.com>
Subject Re: Estimating Time required to compute M/Rjob
Date Sat, 16 Apr 2011 20:19:07 GMT
some additional thoughts about the the  'variables' involved in
characterizing the M/R application itself.

   - the configuration of the cluster for numbers of mappers vs reducers
   compared to the characteristics (amount of work/procesing) required in each
   of the map/shuffle/reduce stages

   - is the application using multiple chained M/R stages?  Multi stage
   M/R's are more difficult to tune properly in terms of keeping all workers
   busy  . That may be challenging to model.

2011/4/16 Stephen Boesch <javadba@gmail.com>

> You could consider two scenarios / set of requirements for your estimator:
>    1. Allow it to 'learn' from certain input data and then project running
>    times of similar (or moderately dissimilar) workloads.   So the first steps
>    could be to define a couple of  relatively small "control" M/R jobs on a
>    small-ish dataset and throw it at the unknown (cluster-under-test) hdfs/ M/R
>     cluster.  Try to design the "control" M/R job  in a way that it will be
>    able to completely load down all of the  available DataNodes in the
>     cluster-under-test for at least a brief period of time.   Then you wlil
>    have obtained a decent signal on the capabilities of the cluster under test
>    and may allow a relatively high degree of predictive accuracy for even much
>    larger jobs
>    2. If instead it were your goal to drive the predictions off of a
>    purely mathematical model  - in your terms the "application" and "base file
>    system" - and without any empirical data - then here is an alternative
>    approach.
>       - Follow step (1) above against a variety of "applications" and
>       "base file systems" - especially in configurations for which  you wish your
>       model to provide high quality predictions.
>       - Save  the results in structured data
>       - Derive formulas for characterizing the curves of performance via
>       those variables that you defined (application /  base file system)
> Now you have a trained model.  When it is applied to a new set of
> applications / base file systems it can use the curves you have already
> determined to provide the result without any runtime requirements.
> Obviously the value of this second approach is limited by the degree of
> similarity of the training data to the applications you attempt to model.
>  If all of your training data is on a 50 node cluster against machines with
> IDE drives don't expect good results when asked to model a 1000 node cluster
> using SAN's / RAID's / SCSI's.
> 2011/4/16 Sonal Goyal <sonalgoyal4@gmail.com>
>> What is your MR job doing? What is the amount of data it is processing?
>> What
>> kind of a cluster do you have? Would you be able to share some details
>> about
>> what you are trying to do?
>> If you are looking for metrics, you could look at the Terasort run ..
>> Thanks and Regards,
>> Sonal
>> <https://github.com/sonalgoyal/hiho>Hadoop ETL and Data
>> Integration<https://github.com/sonalgoyal/hiho>
>> Nube Technologies <http://www.nubetech.co>
>> <http://in.linkedin.com/in/sonalgoyal>
>> On Sat, Apr 16, 2011 at 3:31 PM, real great..
>> <greatness.hardness@gmail.com>wrote:
>> > Hi,
>> > As a part of my final year BE final project I want to estimate the time
>> > required by a M/R job given an application and a base file system.
>> > Can you folks please help me by posting some thoughts on this issue or
>> > posting some links here.
>> >
>> > --
>> > Regards,
>> > R.V.
>> >

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message