hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ryan LeCompte" <lecom...@gmail.com>
Subject Re: Hadoop & EC2
Date Wed, 03 Sep 2008 14:05:09 GMT

I noticed that you mentioned using Amazon's new elastic block store as
an alternative to using S3. Right now I'm testing pushing data to S3,
then moving it from S3 into HDFS once the Hadoop cluster is up and
running in EC2. It works pretty well -- moving data from S3 to HDFS is
fast when the data in S3 is broken up into multiple files, since
bin/hadoop distcp uses a Map/Reduce job to efficiently transfer the

Are there any real advantages to using the new elastic block store? Is
moving data from the elastic block store into HDFS any faster than
doing it from S3? Or can HDFS essentially live inside of the elastic
block store?



On Wed, Sep 3, 2008 at 9:54 AM, Tom White <tom.e.white@gmail.com> wrote:
> There's a case study with some numbers in it from a presentation I
> gave on Hadoop and AWS in London last month, which you may find
> interesting: http://skillsmatter.com/custom/presentations/ec2-talk.pdf.
> tim robertson <timrobertson100@gmail.com> wrote:
>> For these small
>> datasets, you might find it useful - let me know if I should spend
>> time finishing it (Or submit help?) - it is really very simple.
> This sounds very useful. Please consider creating a Jira and
> submitting the code (even if it's not "finished" folks might like to
> see it). Thanks.
> Tom
>> Cheers
>> Tim
>> On Tue, Sep 2, 2008 at 2:22 PM, Ryan LeCompte <lecompte@gmail.com> wrote:
>>> Hi Tim,
>>> Are you mostly just processing/parsing textual log files? How many
>>> maps/reduces did you configure in your hadoop-ec2-env.sh file? How
>>> many did you configure in your JobConf? Just trying to get an idea of
>>> what to expect in terms of performance. I'm noticing that it takes
>>> about 16 minutes to transfer about 15GB of textual uncompressed data
>>> from S3 into HDFS after the cluster has started with 15 nodes. I was
>>> expecting this to take a shorter amount of time, but maybe I'm
>>> incorrect in my assumptions. I am also noticing that it takes about 15
>>> minutes to parse through the 15GB of data with a 15 node cluster.
>>> Thanks,
>>> Ryan
>>> On Tue, Sep 2, 2008 at 3:29 AM, tim robertson <timrobertson100@gmail.com>
>>>> I have been processing only 100s GBs on EC2, not 1000's and using 20
>>>> nodes and really only in exploration and testing phase right now.
>>>> On Tue, Sep 2, 2008 at 8:44 AM, Andrew Hitchcock <adpowers@gmail.com>
>>>>> Hi Ryan,
>>>>> Just a heads up, if you require more than the 20 node limit, Amazon
>>>>> provides a form to request a higher limit:
>>>>> http://www.amazon.com/gp/html-forms-controller/ec2-request
>>>>> Andrew
>>>>> On Mon, Sep 1, 2008 at 10:43 PM, Ryan LeCompte <lecompte@gmail.com>
>>>>>> Hello all,
>>>>>> I'm curious to see how many people are using EC2 to execute their
>>>>>> Hadoop cluster and map/reduce programs, and how many are using
>>>>>> home-grown datacenters. It seems like the 20 node limit with EC2
is a
>>>>>> bit crippling when one wants to process many gigabytes of data. Has
>>>>>> anyone found this to be the case? How much data are people processing
>>>>>> with their 20 node limit on EC2? Curious what the thoughts are...
>>>>>> Thanks,
>>>>>> Ryan

View raw message