hadoop-hdfs-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Devin Suiter RDX <dsui...@rdx.com>
Subject Re: Performance
Date Tue, 25 Feb 2014 20:43:29 GMT
http://sortbenchmark.org/

Doesn't just cover Hadoop, but maybe the methodology will give you an idea
of what you're looking for.

There's too many variables to pin down a "general" average. Every job will
run differently on every cluster, given the machines can be heterogenous
builds, with heterogenous configs at the machine level, then the cluster
will have configs that may or may not override the machine configs...plus
the job submitter can specify runtime variables...

Things like the type of data being processed affect the amount of disk I/O,
network traffic required, etc., which are in turn affected by their
components...

Throwing more nodes at a problem will usually make it faster, but how much
faster depends...

Best way to read your cluster is establish a benchmark operation that
models your expected use case (or one of them), then adjust things on the
cluster and see what tips the time, spill, network traffic, etc. one way or
another.

Eric Sammer's *Hadoop Operations* will break down nicely how real-life
cluster configs affect performance. There are also a lot of case studies in
Tom White's * Hadoop: The Definitive Guide*.

*Devin Suiter*
Jr. Data Solutions Software Engineer
100 Sandusky Street | 2nd Floor | Pittsburgh, PA 15212
Google Voice: 412-256-8556 | www.rdx.com


On Tue, Feb 25, 2014 at 3:09 PM, Brian Stempin <bstempin@rightaction.com>wrote:

> Part of the problem is the word, "process."  That could be really
> complicated or really easy.  It could also be done in Java or some other
> language via the streaming JAR.
>
> It's hard for anyone to say without more details.  Even with more details,
> its still pretty hard to say.
>
> Brian
>
>
> On Mon, Feb 24, 2014 at 1:22 PM, Thomas Bentsen <th@bentzn.com> wrote:
>
>> Thanks Dieter!
>> I'll look into it.
>>
>> Still... It would be nice to hear something from the real world. Would
>> any of you working with Hadoop in a prod env be willing to share
>> something?
>>
>> /th
>>
>>
>>
>>
>> On Mon, 2014-02-24 at 16:56 +0100, Dieter De Witte wrote:
>> > Hi,
>> >
>> > The terasort benchmark is probably the most common. It has mappers and
>> > reducers doing 'nothing', this way you only use the framework's
>> > mergesort functionalities.
>> >
>> >
>> > Regards, Dieter
>> >
>> >
>> >
>> > 2014-02-24 16:42 GMT+01:00 Thomas Bentsen <th@bentzn.com>:
>> >         Hi everyone
>> >
>> >         I am still beginning Hadoop.
>> >         Is there any benchmarks or 'performance heuristics' for
>> >         Hadoop?
>> >         Is it possible to say something like 'You can process X lines
>> >         of GZipped
>> >         log file on a medium AWS server in Y minutes"? I would like to
>> >         get an
>> >         idea of what kind of workflow is possible.
>> >
>> >         Thanks in advance
>> >
>> >         Thomas Bentsen
>> >
>> >
>> >
>>
>>
>>
>

Mime
View raw message