hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Tom White" <tom.e.wh...@gmail.com>
Subject Re: Hadoop & EC2
Date Thu, 04 Sep 2008 13:21:01 GMT
On Thu, Sep 4, 2008 at 1:46 PM, Ryan LeCompte <lecompte@gmail.com> wrote:
> I'm noticing that using bin/hadoop fs -put ... svn://... is uploading
> multi-gigabyte files in ~64MB chunks.

That's because S3Filesystem stores files as 64MB blocks on S3.

> Then, when this is copied from
> S3 into HDFS using bin/hadoop distcp. Once the files are there and the
> job begins, it looks like it's breaking up the 4 multigigabyte text
> files into about 225 maps. Does this mean that each map is roughly
> processing 64MB of data each?

Yes, HDFS stores files as 64MB blocks too, and map input is split by
default so each map processes one block.

>If so, is there any way to change this
> so that I can get my map tasks to process more data at a time? I'm
> curious if this will shorten the time it takes to run the program.

You could try increasing the HDFS block size. 128MB is actually
usually a better value, for this very reason.

In the future https://issues.apache.org/jira/browse/HADOOP-2560 will
help here too.

> Tom, in your article about Hadoop + EC2 you mention processing about
> 100GB of logs in under 6 minutes or so.

In this article:
it took 35 minutes to run the job. I'm planning on doing some
benchmarking on EC2 fairly soon, which should help us improve the
performance of Hadoop on EC2. It's worth remarking that this was
running on small instances. The larger instances perform a lot better
in my experience.

> Do you remember how many EC2
> instances you had running, and also how many map tasks did you have to
> operate on the 100GB? Was each map task handling about 1GB each?

I was running 20 nodes, and each map task was handling a HDFS block, 64MB.

Hope this helps,


View raw message