Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@hadoop.apache.org
Received-SPF: pass (athena.apache.org: domain of dsuiter@rdx.com designates
 74.125.82.173 as permitted sender)
MIME-Version: 1.0
In-Reply-To: 
 <CAHJCPcU2YRW079yLA0pf9fA9nFNXR0ywkJDffZeyirQhjSjahw@mail.gmail.com>
References: <1393256571.6895.2.camel@bentzn-laptop-2013>
	<CALSJUsRw3kwjUwnuU=_ywiaMj65PxPVqjHgXggPDsKRRwuFHXw@mail.gmail.com>
	<1393266174.6895.5.camel@bentzn-laptop-2013>
	<CAHJCPcU2YRW079yLA0pf9fA9nFNXR0ywkJDffZeyirQhjSjahw@mail.gmail.com>
Date: Tue, 25 Feb 2014 15:43:29 -0500
Message-ID: 
 <CAE_UNJVJgkgWQBeMR9bM6oB=ZbMKrxLSsxkfhuTyLyDYr6mx7g@mail.gmail.com>
Subject: Re: Performance
From: Devin Suiter RDX <dsuiter@rdx.com>
To: user@hadoop.apache.org
Content-Type: multipart/alternative; boundary=bcaec53d5a8b04cf4904f341245b

--bcaec53d5a8b04cf4904f341245b
Content-Type: text/plain; charset=ISO-8859-1

http://sortbenchmark.org/

Doesn't just cover Hadoop, but maybe the methodology will give you an idea
of what you're looking for.

There's too many variables to pin down a "general" average. Every job will
run differently on every cluster, given the machines can be heterogenous
builds, with heterogenous configs at the machine level, then the cluster
will have configs that may or may not override the machine configs...plus
the job submitter can specify runtime variables...

Things like the type of data being processed affect the amount of disk I/O,
network traffic required, etc., which are in turn affected by their
components...

Throwing more nodes at a problem will usually make it faster, but how much
faster depends...

Best way to read your cluster is establish a benchmark operation that
models your expected use case (or one of them), then adjust things on the
cluster and see what tips the time, spill, network traffic, etc. one way or
another.

Eric Sammer's *Hadoop Operations* will break down nicely how real-life
cluster configs affect performance. There are also a lot of case studies in
Tom White's * Hadoop: The Definitive Guide*.

*Devin Suiter*
Jr. Data Solutions Software Engineer
100 Sandusky Street | 2nd Floor | Pittsburgh, PA 15212
Google Voice: 412-256-8556 | www.rdx.com


On Tue, Feb 25, 2014 at 3:09 PM, Brian Stempin <bstempin@rightaction.com>wrote:

> Part of the problem is the word, "process."  That could be really
> complicated or really easy.  It could also be done in Java or some other
> language via the streaming JAR.
>
> It's hard for anyone to say without more details.  Even with more details,
> its still pretty hard to say.
>
> Brian
>
>
> On Mon, Feb 24, 2014 at 1:22 PM, Thomas Bentsen <th@bentzn.com> wrote:
>
>> Thanks Dieter!
>> I'll look into it.
>>
>> Still... It would be nice to hear something from the real world. Would
>> any of you working with Hadoop in a prod env be willing to share
>> something?
>>
>> /th
>>
>>
>>
>>
>> On Mon, 2014-02-24 at 16:56 +0100, Dieter De Witte wrote:
>> > Hi,
>> >
>> > The terasort benchmark is probably the most common. It has mappers and
>> > reducers doing 'nothing', this way you only use the framework's
>> > mergesort functionalities.
>> >
>> >
>> > Regards, Dieter
>> >
>> >
>> >
>> > 2014-02-24 16:42 GMT+01:00 Thomas Bentsen <th@bentzn.com>:
>> >         Hi everyone
>> >
>> >         I am still beginning Hadoop.
>> >         Is there any benchmarks or 'performance heuristics' for
>> >         Hadoop?
>> >         Is it possible to say something like 'You can process X lines
>> >         of GZipped
>> >         log file on a medium AWS server in Y minutes"? I would like to
>> >         get an
>> >         idea of what kind of workflow is possible.
>> >
>> >         Thanks in advance
>> >
>> >         Thomas Bentsen
>> >
>> >
>> >
>>
>>
>>
>

--bcaec53d5a8b04cf4904f341245b
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr"><a href=3D"http://sortbenchmark.org/">http://sortbenchmark=
.org/</a><br><div><br></div><div>Doesn&#39;t just cover Hadoop, but maybe t=
he methodology will give you an idea of what you&#39;re looking for.</div><=
div>
<br></div><div>There&#39;s too many variables to pin down a &quot;general&q=
uot; average. Every job will run differently on every cluster, given the ma=
chines can be heterogenous builds, with heterogenous configs at the machine=
 level, then the cluster will have configs that may or may not override the=
 machine configs...plus the job submitter can specify runtime variables...<=
/div>
<div><br></div><div>Things like the type of data being processed affect the=
 amount of disk I/O, network traffic required, etc., which are in turn affe=
cted by their components...</div><div><br></div><div>Throwing more nodes at=
 a problem will usually make it faster, but how much faster depends...</div=
>
<div><br></div><div>Best way to read your cluster is establish a benchmark =
operation that models your expected use case (or one of them), then adjust =
things on the cluster and see what tips the time, spill, network traffic, e=
tc. one way or another.</div>
<div><br></div><div>Eric Sammer&#39;s <i>Hadoop Operations</i>=A0will break=
 down nicely how real-life cluster configs affect performance. There are al=
so a lot of case studies in Tom White&#39;s <i>=A0Hadoop: The Definitive Gu=
ide</i>.</div>
</div><div class=3D"gmail_extra"><br clear=3D"all"><div><div dir=3D"ltr"><b=
>Devin Suiter</b><div><div>Jr. Data Solutions Software Engineer</div><div><=
div><img src=3D"http://i76.servimg.com/u/f76/12/40/55/53/untitl10.png"></di=
v><div>
100 Sandusky Street | 2nd Floor | Pittsburgh, PA 15212<br>Google Voice: 412=
-256-8556 |=A0<a href=3D"http://www.rdx.com/" target=3D"_blank">www.rdx.com=
</a></div></div></div></div></div>
<br><br><div class=3D"gmail_quote">On Tue, Feb 25, 2014 at 3:09 PM, Brian S=
tempin <span dir=3D"ltr">&lt;<a href=3D"mailto:bstempin@rightaction.com" ta=
rget=3D"_blank">bstempin@rightaction.com</a>&gt;</span> wrote:<br><blockquo=
te class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc so=
lid;padding-left:1ex">
<div dir=3D"ltr">Part of the problem is the word, &quot;process.&quot; =A0T=
hat could be really complicated or really easy. =A0It could also be done in=
 Java or some other language via the streaming JAR.<div><br></div><div>It&#=
39;s hard for anyone to say without more details. =A0Even with more details=
, its still pretty hard to say.</div>

<div><br></div><div>Brian</div></div><div class=3D"gmail_extra"><br><br><di=
v class=3D"gmail_quote">On Mon, Feb 24, 2014 at 1:22 PM, Thomas Bentsen <sp=
an dir=3D"ltr">&lt;<a href=3D"mailto:th@bentzn.com" target=3D"_blank">th@be=
ntzn.com</a>&gt;</span> wrote:<br>

<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex">Thanks Dieter!<br>
I&#39;ll look into it.<br>
<br>
Still... It would be nice to hear something from the real world. Would<br>
any of you working with Hadoop in a prod env be willing to share<br>
something?<br>
<br>
/th<br>
<div><div><br>
<br>
<br>
<br>
On Mon, 2014-02-24 at 16:56 +0100, Dieter De Witte wrote:<br>
&gt; Hi,<br>
&gt;<br>
&gt; The terasort benchmark is probably the most common. It has mappers and=
<br>
&gt; reducers doing &#39;nothing&#39;, this way you only use the framework&=
#39;s<br>
&gt; mergesort functionalities.<br>
&gt;<br>
&gt;<br>
&gt; Regards, Dieter<br>
&gt;<br>
&gt;<br>
&gt;<br>
&gt; 2014-02-24 16:42 GMT+01:00 Thomas Bentsen &lt;<a href=3D"mailto:th@ben=
tzn.com" target=3D"_blank">th@bentzn.com</a>&gt;:<br>
&gt; =A0 =A0 =A0 =A0 Hi everyone<br>
&gt;<br>
&gt; =A0 =A0 =A0 =A0 I am still beginning Hadoop.<br>
&gt; =A0 =A0 =A0 =A0 Is there any benchmarks or &#39;performance heuristics=
&#39; for<br>
&gt; =A0 =A0 =A0 =A0 Hadoop?<br>
&gt; =A0 =A0 =A0 =A0 Is it possible to say something like &#39;You can proc=
ess X lines<br>
&gt; =A0 =A0 =A0 =A0 of GZipped<br>
&gt; =A0 =A0 =A0 =A0 log file on a medium AWS server in Y minutes&quot;? I =
would like to<br>
&gt; =A0 =A0 =A0 =A0 get an<br>
&gt; =A0 =A0 =A0 =A0 idea of what kind of workflow is possible.<br>
&gt;<br>
&gt; =A0 =A0 =A0 =A0 Thanks in advance<br>
&gt;<br>
&gt; =A0 =A0 =A0 =A0 Thomas Bentsen<br>
&gt;<br>
&gt;<br>
&gt;<br>
<br>
<br>
</div></div></blockquote></div><br></div>
</blockquote></div><br></div>

--bcaec53d5a8b04cf4904f341245b--