Return-Path: Delivered-To: apmail-hadoop-common-user-archive@www.apache.org Received: (qmail 88561 invoked from network); 16 Apr 2011 20:09:30 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 16 Apr 2011 20:09:30 -0000 Received: (qmail 23330 invoked by uid 500); 16 Apr 2011 20:09:27 -0000 Delivered-To: apmail-hadoop-common-user-archive@hadoop.apache.org Received: (qmail 23284 invoked by uid 500); 16 Apr 2011 20:09:27 -0000 Mailing-List: contact common-user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: common-user@hadoop.apache.org Delivered-To: mailing list common-user@hadoop.apache.org Received: (qmail 23276 invoked by uid 99); 16 Apr 2011 20:09:27 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 16 Apr 2011 20:09:27 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=FREEMAIL_FROM,HTML_MESSAGE,RCVD_IN_DNSWL_LOW,RFC_ABUSE_POST,SPF_PASS,T_TO_NO_BRKTS_FREEMAIL X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of javadba@gmail.com designates 209.85.218.48 as permitted sender) Received: from [209.85.218.48] (HELO mail-yi0-f48.google.com) (209.85.218.48) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 16 Apr 2011 20:09:20 +0000 Received: by yia28 with SMTP id 28so2019587yia.35 for ; Sat, 16 Apr 2011 13:08:59 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:in-reply-to:references:date :message-id:subject:from:to:content-type; bh=UBNyPeNN4o6ZuCBjjaZwHskgjWGZj+FSw2bau2q93N4=; b=LI3pEMgeNmPXXBLtTWsEYP8ySA4KmrflspuyekuElrZD6ARqQPEZhbTCz0xsh4ecZQ 7gd+4pUnyFnxsQecvYYdeMyKWFt3e39q5DP4T4aKUb23Hp3gBTGWau7WM0mCWtOoEvOa lX/3vqj9r6QqYRMyBT3dzask2V6HjXGLIwdu4= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; b=DLGQ7thWIHH/rtVkJvgkuMs98JBqKtHSsJWT7P4o0eZzsraKocVfYxBaUkWCjsuUDp crUUlJQs4+PeasU/gAZKQZOI0b3CFy0CJT5w9kRrUOz6pcOnA2PoDdNAvMxtj240tO2G 0cMj9XPe7d9SCZG/gP+ZfJ8HJSbgftqhboREg= MIME-Version: 1.0 Received: by 10.236.197.99 with SMTP id s63mr2085727yhn.325.1302984539077; Sat, 16 Apr 2011 13:08:59 -0700 (PDT) Received: by 10.236.108.3 with HTTP; Sat, 16 Apr 2011 13:08:59 -0700 (PDT) In-Reply-To: References: Date: Sat, 16 Apr 2011 13:08:59 -0700 Message-ID: Subject: Re: Estimating Time required to compute M/Rjob From: Stephen Boesch To: common-user@hadoop.apache.org Content-Type: multipart/alternative; boundary=20cf303f6d4e997c2c04a10ebadc X-Virus-Checked: Checked by ClamAV on apache.org --20cf303f6d4e997c2c04a10ebadc Content-Type: text/plain; charset=ISO-8859-1 You could consider two scenarios / set of requirements for your estimator: 1. Allow it to 'learn' from certain input data and then project running times of similar (or moderately dissimilar) workloads. So the first steps could be to define a couple of relatively small "control" M/R jobs on a small-ish dataset and throw it at the unknown (cluster-under-test) hdfs/ M/R cluster. Try to design the "control" M/R job in a way that it will be able to completely load down all of the available DataNodes in the cluster-under-test for at least a brief period of time. Then you wlil have obtained a decent signal on the capabilities of the cluster under test and may allow a relatively high degree of predictive accuracy for even much larger jobs 2. If instead it were your goal to drive the predictions off of a purely mathematical model - in your terms the "application" and "base file system" - and without any empirical data - then here is an alternative approach. - Follow step (1) above against a variety of "applications" and "base file systems" - especially in configurations for which you wish your model to provide high quality predictions. - Save the results in structured data - Derive formulas for characterizing the curves of performance via those variables that you defined (application / base file system) Now you have a trained model. When it is applied to a new set of applications / base file systems it can use the curves you have already determined to provide the result without any runtime requirements. Obviously the value of this second approach is limited by the degree of similarity of the training data to the applications you attempt to model. If all of your training data is on a 50 node cluster against machines with IDE drives don't expect good results when asked to model a 1000 node cluster using SAN's / RAID's / SCSI's. 2011/4/16 Sonal Goyal > What is your MR job doing? What is the amount of data it is processing? > What > kind of a cluster do you have? Would you be able to share some details > about > what you are trying to do? > > If you are looking for metrics, you could look at the Terasort run .. > > Thanks and Regards, > Sonal > Hadoop ETL and Data > Integration > Nube Technologies > > > > > > > > On Sat, Apr 16, 2011 at 3:31 PM, real great.. > wrote: > > > Hi, > > As a part of my final year BE final project I want to estimate the time > > required by a M/R job given an application and a base file system. > > Can you folks please help me by posting some thoughts on this issue or > > posting some links here. > > > > -- > > Regards, > > R.V. > > > --20cf303f6d4e997c2c04a10ebadc--