Return-Path: X-Original-To: apmail-hadoop-common-user-archive@www.apache.org Delivered-To: apmail-hadoop-common-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 14F0810977 for ; Tue, 15 Apr 2014 03:52:00 +0000 (UTC) Received: (qmail 79930 invoked by uid 500); 15 Apr 2014 03:51:51 -0000 Delivered-To: apmail-hadoop-common-user-archive@hadoop.apache.org Received: (qmail 79845 invoked by uid 500); 15 Apr 2014 03:51:46 -0000 Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hadoop.apache.org Delivered-To: mailing list user@hadoop.apache.org Received: (qmail 79838 invoked by uid 99); 15 Apr 2014 03:51:46 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 15 Apr 2014 03:51:46 +0000 X-ASF-Spam-Status: No, hits=2.5 required=5.0 tests=FREEMAIL_ENVFROM_END_DIGIT,HTML_IMAGE_ONLY_20,HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS,T_REMOTE_IMAGE X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of raoshashidhar123@gmail.com designates 74.125.82.173 as permitted sender) Received: from [74.125.82.173] (HELO mail-we0-f173.google.com) (74.125.82.173) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 15 Apr 2014 03:51:41 +0000 Received: by mail-we0-f173.google.com with SMTP id w61so9052054wes.32 for ; Mon, 14 Apr 2014 20:51:20 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; bh=AyWH3P0aJ297J/+Mkkp59oDmz9waKCq5m8lPqOgon5Q=; b=C2VMJZnNf3IkuikRT4Sw3+NR6WLVgjRteKbjvCpom5MUhea7GIlskHgjb8ykvpstSD sJAfFJUbBrnT6e6MVoxqVkB+IB/bwYi2FAn/0N7ve3+WVhfChE9QPapOe2qIsszgBz+t fqqitJUe1r/bg8hsNHlwcH1LxsT+kuignjwjBE3sxPqhcAbKacUtS1yYtKpXtoCESt/8 aJnBb6gO2oWQ5MgMx/Ub2QNHzIWabO9+cdXEIMsUmT7zQh60nYhgJj6WxlyRAFs8lO8c ULBvI9d9kgJuJxWGAQ9zOCPE7IWcYoxhY3MO8QvOv10CjXbgjv5Pd1BRdQxVThPg90c8 vLSw== MIME-Version: 1.0 X-Received: by 10.180.149.143 with SMTP id ua15mr455518wib.36.1397533880366; Mon, 14 Apr 2014 20:51:20 -0700 (PDT) Received: by 10.180.78.165 with HTTP; Mon, 14 Apr 2014 20:51:20 -0700 (PDT) In-Reply-To: References: Date: Tue, 15 Apr 2014 09:21:20 +0530 Message-ID: Subject: Re: Time taken to do a word count on 10 TB data. From: Shashidhar Rao To: user@hadoop.apache.org Content-Type: multipart/alternative; boundary=001a11c38e70803ae804f70cb6e2 X-Virus-Checked: Checked by ClamAV on apache.org --001a11c38e70803ae804f70cb6e2 Content-Type: text/plain; charset=UTF-8 Thanks stantley shi On Tue, Apr 15, 2014 at 6:25 AM, Stanley Shi wrote: > Rough estimation: since word count requires very little computation, it is > io centric, we can do estimation based on disk speed. > > Assume 10 disk with each 100MBps for each node, that is about 1GBps per > node; assume 70% utilization in mapper, we have 700MBps for each node. For > 30 nodes, it is total about 20GBps, so we need about 500 seconds for 10 TB > data. > Adding some map reduce overhead and the final merging, say 20% > overhead, we can expect about 10 minutes here. > > > On Tuesday, April 15, 2014, Shashidhar Rao > wrote: > >> Hi, >> >> Can somebody provide me a rough estimate of the time taken in hours/mins >> for a cluster of say 30 nodes to run a map reduce job to perform a word >> count on say 10 TB of data, assuming that the hardware and the map reduce >> program is tuned optimally. >> >> Just a rough estimate, it could be 5TB,10 TB or 20 TB data. If not word >> count it could be just to analyze the above size of data. >> >> Regards >> Shashidhar >> > > > -- > Regards, > *Stanley Shi,* > > > --001a11c38e70803ae804f70cb6e2 Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable
Thanks stantley shi

On Tue, Apr 15, 2014 at 6:25 AM, Stanley Shi <= span dir=3D"ltr"><sshi@gopivotal.com> wrote:
Rough estimation: since word count requires = very little computation, it is io centric, we can do estimation based on di= sk speed.

Assume 10 disk with each 100MBps for each node, that is abou= t 1GBps per node; assume 70% utilization in mapper, we have 700MBps for eac= h node. For 30 nodes, it is total about 20GBps, so we need about 500 second= s for 10 TB data.
Adding some map reduce overhead and the final merging, say 20% overhea= d,=C2=A0we can expect about 10 minutes here.


On Tuesday, April 15, 2014, Shashidh= ar Rao <= raoshashidhar123@gmail.com> wrote:
Hi,

Can = somebody provide me a rough estimate of the time taken in hours/mins for a = cluster of say 30 nodes to run a map reduce job to perform a word count on = say 10 TB of data, assuming that the hardware and the map reduce program is= tuned optimally.

Just a rough estimate, it could be 5TB,10 TB or 20 TB d= ata. If not word count it could be just to analyze the above size of data.<= /div>

Regards
Shashidhar


--
Regards,
Stanley Shi= ,



--001a11c38e70803ae804f70cb6e2--