Return-Path: X-Original-To: apmail-hadoop-hdfs-user-archive@minotaur.apache.org Delivered-To: apmail-hadoop-hdfs-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 55E12100AA for ; Tue, 25 Feb 2014 20:44:03 +0000 (UTC) Received: (qmail 1540 invoked by uid 500); 25 Feb 2014 20:43:55 -0000 Delivered-To: apmail-hadoop-hdfs-user-archive@hadoop.apache.org Received: (qmail 1293 invoked by uid 500); 25 Feb 2014 20:43:55 -0000 Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hadoop.apache.org Delivered-To: mailing list user@hadoop.apache.org Received: (qmail 1286 invoked by uid 99); 25 Feb 2014 20:43:55 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 25 Feb 2014 20:43:55 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of dsuiter@rdx.com designates 74.125.82.173 as permitted sender) Received: from [74.125.82.173] (HELO mail-we0-f173.google.com) (74.125.82.173) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 25 Feb 2014 20:43:51 +0000 Received: by mail-we0-f173.google.com with SMTP id x48so886867wes.18 for ; Tue, 25 Feb 2014 12:43:29 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=rdx.com; s=google; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; bh=HL7PFHiAeoems0EaS3npghAg+LZaAu3kpmyPxKRmiwY=; b=Pw86UTcKcyWDTnxJqQAlS0V2mrCYGAbcebrDH05ufpchiypa/CxoQAdnKO8U6VEeGz c3o5MtGaVaj0R9a97GSAuSekCxI4cEK0TXSiYFBUhHE8fCdfj7loLaghW9l5UQXOHsQT 6+w0CwB7E/gQSLcWKDDL9hmvRP9EeDYp97Pxo= X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:in-reply-to:references:date :message-id:subject:from:to:content-type; bh=HL7PFHiAeoems0EaS3npghAg+LZaAu3kpmyPxKRmiwY=; b=K4rZFwolMteJMGU/XxuiTdKmsU1jdKzWDTDbLMFwij/zDSwscqsJOTb6WyggxsUvjH Sk0VRS5VNRN0v061xwtm6OJn4O4Trwc1BZxsicMsTOujzrB2jJfDk3pT9cynMVWA6jKg 2E3GsgK4BE76eTtRhvYDegQRo2vGnE5HMx9E3Pe1N0MQYeBXD4L/+5E7SwiMQ/xICMYy MDNQGytNWOHYzyua7KOtIP23tmfNV3rjQNbBwWriwc9w37pA4WREI012CawhnP08Fu+h tNJgY9x3MozKd1sSe055nRzNP48nC7Gq+sLDN3gN+9F7sEC3n4wIA3SI/oPN0JbhDqzS OvCg== X-Gm-Message-State: ALoCoQmJfhWYV6Id5wpDrRc1FX4vDUo/dXCDsjD2GUE2j1PVq+oCFggmPkIIBJYqiF4gZneEcXf5 MIME-Version: 1.0 X-Received: by 10.180.19.69 with SMTP id c5mr4810050wie.7.1393361009554; Tue, 25 Feb 2014 12:43:29 -0800 (PST) Received: by 10.216.122.135 with HTTP; Tue, 25 Feb 2014 12:43:29 -0800 (PST) In-Reply-To: References: <1393256571.6895.2.camel@bentzn-laptop-2013> <1393266174.6895.5.camel@bentzn-laptop-2013> Date: Tue, 25 Feb 2014 15:43:29 -0500 Message-ID: Subject: Re: Performance From: Devin Suiter RDX To: user@hadoop.apache.org Content-Type: multipart/alternative; boundary=bcaec53d5a8b04cf4904f341245b X-Virus-Checked: Checked by ClamAV on apache.org --bcaec53d5a8b04cf4904f341245b Content-Type: text/plain; charset=ISO-8859-1 http://sortbenchmark.org/ Doesn't just cover Hadoop, but maybe the methodology will give you an idea of what you're looking for. There's too many variables to pin down a "general" average. Every job will run differently on every cluster, given the machines can be heterogenous builds, with heterogenous configs at the machine level, then the cluster will have configs that may or may not override the machine configs...plus the job submitter can specify runtime variables... Things like the type of data being processed affect the amount of disk I/O, network traffic required, etc., which are in turn affected by their components... Throwing more nodes at a problem will usually make it faster, but how much faster depends... Best way to read your cluster is establish a benchmark operation that models your expected use case (or one of them), then adjust things on the cluster and see what tips the time, spill, network traffic, etc. one way or another. Eric Sammer's *Hadoop Operations* will break down nicely how real-life cluster configs affect performance. There are also a lot of case studies in Tom White's * Hadoop: The Definitive Guide*. *Devin Suiter* Jr. Data Solutions Software Engineer 100 Sandusky Street | 2nd Floor | Pittsburgh, PA 15212 Google Voice: 412-256-8556 | www.rdx.com On Tue, Feb 25, 2014 at 3:09 PM, Brian Stempin wrote: > Part of the problem is the word, "process." That could be really > complicated or really easy. It could also be done in Java or some other > language via the streaming JAR. > > It's hard for anyone to say without more details. Even with more details, > its still pretty hard to say. > > Brian > > > On Mon, Feb 24, 2014 at 1:22 PM, Thomas Bentsen wrote: > >> Thanks Dieter! >> I'll look into it. >> >> Still... It would be nice to hear something from the real world. Would >> any of you working with Hadoop in a prod env be willing to share >> something? >> >> /th >> >> >> >> >> On Mon, 2014-02-24 at 16:56 +0100, Dieter De Witte wrote: >> > Hi, >> > >> > The terasort benchmark is probably the most common. It has mappers and >> > reducers doing 'nothing', this way you only use the framework's >> > mergesort functionalities. >> > >> > >> > Regards, Dieter >> > >> > >> > >> > 2014-02-24 16:42 GMT+01:00 Thomas Bentsen : >> > Hi everyone >> > >> > I am still beginning Hadoop. >> > Is there any benchmarks or 'performance heuristics' for >> > Hadoop? >> > Is it possible to say something like 'You can process X lines >> > of GZipped >> > log file on a medium AWS server in Y minutes"? I would like to >> > get an >> > idea of what kind of workflow is possible. >> > >> > Thanks in advance >> > >> > Thomas Bentsen >> > >> > >> > >> >> >> > --bcaec53d5a8b04cf4904f341245b Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable
http://sortbenchmark= .org/

Doesn't just cover Hadoop, but maybe t= he methodology will give you an idea of what you're looking for.
<= div>
There's too many variables to pin down a "general&q= uot; average. Every job will run differently on every cluster, given the ma= chines can be heterogenous builds, with heterogenous configs at the machine= level, then the cluster will have configs that may or may not override the= machine configs...plus the job submitter can specify runtime variables...<= /div>

Things like the type of data being processed affect the= amount of disk I/O, network traffic required, etc., which are in turn affe= cted by their components...

Throwing more nodes at= a problem will usually make it faster, but how much faster depends...

Best way to read your cluster is establish a benchmark = operation that models your expected use case (or one of them), then adjust = things on the cluster and see what tips the time, spill, network traffic, e= tc. one way or another.

Eric Sammer's Hadoop Operations=A0will break= down nicely how real-life cluster configs affect performance. There are al= so a lot of case studies in Tom White's =A0Hadoop: The Definitive Gu= ide.

Devin Suiter
Jr. Data Solutions Software Engineer
<= div>
100 Sandusky Street | 2nd Floor | Pittsburgh, PA 15212
Google Voice: 412= -256-8556 |=A0www.rdx.com=


On Tue, Feb 25, 2014 at 3:09 PM, Brian S= tempin <bstempin@rightaction.com> wrote:
Part of the problem is the word, "process." =A0T= hat could be really complicated or really easy. =A0It could also be done in= Java or some other language via the streaming JAR.

It&#= 39;s hard for anyone to say without more details. =A0Even with more details= , its still pretty hard to say.

Brian


On Mon, Feb 24, 2014 at 1:22 PM, Thomas Bentsen <th@be= ntzn.com> wrote:
Thanks Dieter!
I'll look into it.

Still... It would be nice to hear something from the real world. Would
any of you working with Hadoop in a prod env be willing to share
something?

/th




On Mon, 2014-02-24 at 16:56 +0100, Dieter De Witte wrote:
> Hi,
>
> The terasort benchmark is probably the most common. It has mappers and=
> reducers doing 'nothing', this way you only use the framework&= #39;s
> mergesort functionalities.
>
>
> Regards, Dieter
>
>
>
> 2014-02-24 16:42 GMT+01:00 Thomas Bentsen <th@bentzn.com>:
> =A0 =A0 =A0 =A0 Hi everyone
>
> =A0 =A0 =A0 =A0 I am still beginning Hadoop.
> =A0 =A0 =A0 =A0 Is there any benchmarks or 'performance heuristics= ' for
> =A0 =A0 =A0 =A0 Hadoop?
> =A0 =A0 =A0 =A0 Is it possible to say something like 'You can proc= ess X lines
> =A0 =A0 =A0 =A0 of GZipped
> =A0 =A0 =A0 =A0 log file on a medium AWS server in Y minutes"? I = would like to
> =A0 =A0 =A0 =A0 get an
> =A0 =A0 =A0 =A0 idea of what kind of workflow is possible.
>
> =A0 =A0 =A0 =A0 Thanks in advance
>
> =A0 =A0 =A0 =A0 Thomas Bentsen
>
>
>




--bcaec53d5a8b04cf4904f341245b--