hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ryan LeCompte" <lecom...@gmail.com>
Subject Re: Hadoop & EC2
Date Thu, 04 Sep 2008 13:26:20 GMT
Hi Tom,

This clears up my questions.

Thanks!

Ryan



On Thu, Sep 4, 2008 at 9:21 AM, Tom White <tom.e.white@gmail.com> wrote:
> On Thu, Sep 4, 2008 at 1:46 PM, Ryan LeCompte <lecompte@gmail.com> wrote:
>> I'm noticing that using bin/hadoop fs -put ... svn://... is uploading
>> multi-gigabyte files in ~64MB chunks.
>
> That's because S3Filesystem stores files as 64MB blocks on S3.
>
>> Then, when this is copied from
>> S3 into HDFS using bin/hadoop distcp. Once the files are there and the
>> job begins, it looks like it's breaking up the 4 multigigabyte text
>> files into about 225 maps. Does this mean that each map is roughly
>> processing 64MB of data each?
>
> Yes, HDFS stores files as 64MB blocks too, and map input is split by
> default so each map processes one block.
>
>>If so, is there any way to change this
>> so that I can get my map tasks to process more data at a time? I'm
>> curious if this will shorten the time it takes to run the program.
>
> You could try increasing the HDFS block size. 128MB is actually
> usually a better value, for this very reason.
>
> In the future https://issues.apache.org/jira/browse/HADOOP-2560 will
> help here too.
>
>>
>> Tom, in your article about Hadoop + EC2 you mention processing about
>> 100GB of logs in under 6 minutes or so.
>
> In this article:
> http://developer.amazonwebservices.com/connect/entry.jspa?externalID=873,
> it took 35 minutes to run the job. I'm planning on doing some
> benchmarking on EC2 fairly soon, which should help us improve the
> performance of Hadoop on EC2. It's worth remarking that this was
> running on small instances. The larger instances perform a lot better
> in my experience.
>
>> Do you remember how many EC2
>> instances you had running, and also how many map tasks did you have to
>> operate on the 100GB? Was each map task handling about 1GB each?
>
> I was running 20 nodes, and each map task was handling a HDFS block, 64MB.
>
> Hope this helps,
>
> Tom
>

Mime
View raw message