hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Todd Lipcon <t...@cloudera.com>
Subject Re: Optimal setup for a test problem
Date Tue, 13 Apr 2010 18:46:20 GMT
On Tue, Apr 13, 2010 at 11:40 AM, Andrew Nguyen <
andrew-lists-hadoop@ucsfcti.org> wrote:

> Good to know...  The problem is that I'm in an academic environment that
> needs a lot of convincing regarding new computational technologies.  I need
> to show proven benefit before getting the funds to actually implement
> anything.  These servers were the best I could come up with for this
> proof-of-concept.
>
> I changed some settings on the nodes and have been experimenting - and I'm
> seeing about 3.4 mb/sec with TestDFSIO which is pretty consistent with your
> observations below.
>
> Given that, would increasing the block sizes help my performance?  This
> should result in fewer map jobs and keeping the computation locally,
> longer...?  I just need to show that the numbers are better than a single
> machine, even if sacrificing redundancy (or other factors) in the current
> setup.
>
>
If that's your goal, set dfs.replication to 1 in your job - this will make
the output unreplicated, which means it won't go over the network. Of
course, you'll also lose data if a node goes down, but if your goal is to
cheat, it's an effective way of doing so!

You'll also get some benefit by using LZO compression to reduce the amount
of network transfer.

-Todd


-- 
Todd Lipcon
Software Engineer, Cloudera

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message