Mailing-List: contact common-user-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: common-user@hadoop.apache.org
Received-SPF: pass (nike.apache.org: domain of javadba@gmail.com designates
 209.85.218.48 as permitted sender)
DomainKey-Signature: a=rsa-sha1; c=nofws;
        d=gmail.com; s=gamma;
        h=mime-version:in-reply-to:references:date:message-id:subject:from:to
         :content-type;
        b=DLGQ7thWIHH/rtVkJvgkuMs98JBqKtHSsJWT7P4o0eZzsraKocVfYxBaUkWCjsuUDp
         crUUlJQs4+PeasU/gAZKQZOI0b3CFy0CJT5w9kRrUOz6pcOnA2PoDdNAvMxtj240tO2G
         0cMj9XPe7d9SCZG/gP+ZfJ8HJSbgftqhboREg=
MIME-Version: 1.0
In-Reply-To: <BANLkTin6LvNb1X6S36XO7MCRY6Jqiif=yQ@mail.gmail.com>
References: <BANLkTimu+-SxrRQytLBRoeFc7E9WKC1wcA@mail.gmail.com>
	<BANLkTin6LvNb1X6S36XO7MCRY6Jqiif=yQ@mail.gmail.com>
Date: Sat, 16 Apr 2011 13:08:59 -0700
Message-ID: <BANLkTinnGjQcQmgC-2QemXuC7zEoaN-8TQ@mail.gmail.com>
Subject: Re: Estimating Time required to compute M/Rjob
From: Stephen Boesch <javadba@gmail.com>
To: common-user@hadoop.apache.org
Content-Type: multipart/alternative; boundary=20cf303f6d4e997c2c04a10ebadc

--20cf303f6d4e997c2c04a10ebadc
Content-Type: text/plain; charset=ISO-8859-1

You could consider two scenarios / set of requirements for your estimator:


   1. Allow it to 'learn' from certain input data and then project running
   times of similar (or moderately dissimilar) workloads.   So the first steps
   could be to define a couple of  relatively small "control" M/R jobs on a
   small-ish dataset and throw it at the unknown (cluster-under-test) hdfs/ M/R
    cluster.  Try to design the "control" M/R job  in a way that it will be
   able to completely load down all of the  available DataNodes in the
    cluster-under-test for at least a brief period of time.   Then you wlil
   have obtained a decent signal on the capabilities of the cluster under test
   and may allow a relatively high degree of predictive accuracy for even much
   larger jobs
   2. If instead it were your goal to drive the predictions off of a purely
   mathematical model  - in your terms the "application" and "base file system"
   - and without any empirical data - then here is an alternative approach.
      - Follow step (1) above against a variety of "applications" and "base
      file systems" - especially in configurations for which  you wish
your model
      to provide high quality predictions.
      - Save  the results in structured data
      - Derive formulas for characterizing the curves of performance via
      those variables that you defined (application /  base file system)

Now you have a trained model.  When it is applied to a new set of
applications / base file systems it can use the curves you have already
determined to provide the result without any runtime requirements.

Obviously the value of this second approach is limited by the degree of
similarity of the training data to the applications you attempt to model.
 If all of your training data is on a 50 node cluster against machines with
IDE drives don't expect good results when asked to model a 1000 node cluster
using SAN's / RAID's / SCSI's.


2011/4/16 Sonal Goyal <sonalgoyal4@gmail.com>

> What is your MR job doing? What is the amount of data it is processing?
> What
> kind of a cluster do you have? Would you be able to share some details
> about
> what you are trying to do?
>
> If you are looking for metrics, you could look at the Terasort run ..
>
> Thanks and Regards,
> Sonal
> <https://github.com/sonalgoyal/hiho>Hadoop ETL and Data
> Integration<https://github.com/sonalgoyal/hiho>
> Nube Technologies <http://www.nubetech.co>
>
> <http://in.linkedin.com/in/sonalgoyal>
>
>
>
>
>
> On Sat, Apr 16, 2011 at 3:31 PM, real great..
> <greatness.hardness@gmail.com>wrote:
>
> > Hi,
> > As a part of my final year BE final project I want to estimate the time
> > required by a M/R job given an application and a base file system.
> > Can you folks please help me by posting some thoughts on this issue or
> > posting some links here.
> >
> > --
> > Regards,
> > R.V.
> >
>

--20cf303f6d4e997c2c04a10ebadc--