hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Thomas Bentsen ...@bentzn.com>
Subject Re: Performance
Date Tue, 25 Feb 2014 21:28:20 GMT
Thanks a lot guys!
>From Dieters original reply I got TeraSort and I am currently running
different scenarios with that. It seems to be The Benchmark right now.
It's relatively simple and yet it does test most of the functionality.

Devin: You mention a couple of books I already have in the stack for
reading. Do any of you know of an authoritative source on actual
optimization (maybe even 'profiling'?) of a Hadoop cluster?
I am testing on relatively (very) light HW and my background is Java
servers so I started fiddling with mem-settings - of course. Not much
luck there. :-D 


/th





On Tue, 2014-02-25 at 15:43 -0500, Devin Suiter RDX wrote:
> http://sortbenchmark.org/
> 
> 
> Doesn't just cover Hadoop, but maybe the methodology will give you an
> idea of what you're looking for.
> 
> 
> There's too many variables to pin down a "general" average. Every job
> will run differently on every cluster, given the machines can be
> heterogenous builds, with heterogenous configs at the machine level,
> then the cluster will have configs that may or may not override the
> machine configs...plus the job submitter can specify runtime
> variables...
> 
> 
> Things like the type of data being processed affect the amount of disk
> I/O, network traffic required, etc., which are in turn affected by
> their components...
> 
> 
> Throwing more nodes at a problem will usually make it faster, but how
> much faster depends...
> 
> 
> Best way to read your cluster is establish a benchmark operation that
> models your expected use case (or one of them), then adjust things on
> the cluster and see what tips the time, spill, network traffic, etc.
> one way or another.
> 
> 
> Eric Sammer's Hadoop Operations will break down nicely how real-life
> cluster configs affect performance. There are also a lot of case
> studies in Tom White's  Hadoop: The Definitive Guide.
> 
> Devin Suiter
> Jr. Data Solutions Software Engineer
> 
> 100 Sandusky Street | 2nd Floor | Pittsburgh, PA 15212
> Google Voice: 412-256-8556 | www.rdx.com
> 
> 
> On Tue, Feb 25, 2014 at 3:09 PM, Brian Stempin
> <bstempin@rightaction.com> wrote:
>         Part of the problem is the word, "process."  That could be
>         really complicated or really easy.  It could also be done in
>         Java or some other language via the streaming JAR.
>         
>         
>         It's hard for anyone to say without more details.  Even with
>         more details, its still pretty hard to say.
>         
>         
>         Brian
>         
>         
>         On Mon, Feb 24, 2014 at 1:22 PM, Thomas Bentsen
>         <th@bentzn.com> wrote:
>                 Thanks Dieter!
>                 I'll look into it.
>                 
>                 Still... It would be nice to hear something from the
>                 real world. Would
>                 any of you working with Hadoop in a prod env be
>                 willing to share
>                 something?
>                 
>                 /th
>                 
>                 
>                 
>                 
>                 On Mon, 2014-02-24 at 16:56 +0100, Dieter De Witte
>                 wrote:
>                 > Hi,
>                 >
>                 > The terasort benchmark is probably the most common.
>                 It has mappers and
>                 > reducers doing 'nothing', this way you only use the
>                 framework's
>                 > mergesort functionalities.
>                 >
>                 >
>                 > Regards, Dieter
>                 >
>                 >
>                 >
>                 > 2014-02-24 16:42 GMT+01:00 Thomas Bentsen
>                 <th@bentzn.com>:
>                 >         Hi everyone
>                 >
>                 >         I am still beginning Hadoop.
>                 >         Is there any benchmarks or 'performance
>                 heuristics' for
>                 >         Hadoop?
>                 >         Is it possible to say something like 'You
>                 can process X lines
>                 >         of GZipped
>                 >         log file on a medium AWS server in Y
>                 minutes"? I would like to
>                 >         get an
>                 >         idea of what kind of workflow is possible.
>                 >
>                 >         Thanks in advance
>                 >
>                 >         Thomas Bentsen
>                 >
>                 >
>                 >
>                 
>                 
>                 
>         
>         
> 
> 



Mime
View raw message