hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ryan LeCompte" <lecom...@gmail.com>
Subject Re: Hadoop & EC2
Date Tue, 02 Sep 2008 18:41:47 GMT
How can you ensure that the S3 buckets and EC2 instances belong to a
certain zone?

Ryan


On Tue, Sep 2, 2008 at 2:38 PM, Karl Anderson <kra@monkey.org> wrote:
>
> On 2-Sep-08, at 5:22 AM, Ryan LeCompte wrote:
>
>> Hi Tim,
>>
>> Are you mostly just processing/parsing textual log files? How many
>> maps/reduces did you configure in your hadoop-ec2-env.sh file? How
>> many did you configure in your JobConf? Just trying to get an idea of
>> what to expect in terms of performance. I'm noticing that it takes
>> about 16 minutes to transfer about 15GB of textual uncompressed data
>> from S3 into HDFS after the cluster has started with 15 nodes. I was
>> expecting this to take a shorter amount of time, but maybe I'm
>> incorrect in my assumptions. I am also noticing that it takes about 15
>> minutes to parse through the 15GB of data with a 15 node cluster.
>
> I'm seeing much faster speeds.  With 128 nodes running a mapper-only
> downloading job, downloading 30 GB takes roughly a minute, less time than
> the end of job work (which I assume is HDFS replication and bookkeeping).
>  More mappers gives you more parallel downloads, of course.  I'm using a
> Python REST client for S3, and only move data to or from S3 when Hadoop is
> done with it.
>
> Make sure your S3 buckets and EC2 instances are in the same zone.
>
>

Mime
View raw message