hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Tom White" <tom.e.wh...@gmail.com>
Subject Re: Hadoop & EC2
Date Wed, 03 Sep 2008 15:21:48 GMT
On Wed, Sep 3, 2008 at 3:05 PM, Ryan LeCompte <lecompte@gmail.com> wrote:
> Tom,
>
> I noticed that you mentioned using Amazon's new elastic block store as
> an alternative to using S3. Right now I'm testing pushing data to S3,
> then moving it from S3 into HDFS once the Hadoop cluster is up and
> running in EC2. It works pretty well -- moving data from S3 to HDFS is
> fast when the data in S3 is broken up into multiple files, since
> bin/hadoop distcp uses a Map/Reduce job to efficiently transfer the
> data.

Yes, this is a good-enough solution for many applications.

>
> Are there any real advantages to using the new elastic block store? Is
> moving data from the elastic block store into HDFS any faster than
> doing it from S3? Or can HDFS essentially live inside of the elastic
> block store?

Bandwidth between EBS and EC2 is better than between S3 and EC2, so if
you intend to run MapReduce on your data then you might consider
running an elastic Hadoop cluster that stores data on EBS-backed HDFS.
The nice thing is that you can shut down the cluster when you're not
using it and then restart it later. But if you have other applications
that need to access data from S3, then this may not be appropriate.
Also, it may not be as fast as HDFS using local disks for storage.

This is a new area, and I haven't done any measurements, so a lot of
this is conjecture on my part. Hadoop on EBS doesn't exist yet - but
it looks like a natural fit.

>
> Thanks!
>
> Ryan
>
>
> On Wed, Sep 3, 2008 at 9:54 AM, Tom White <tom.e.white@gmail.com> wrote:
>> There's a case study with some numbers in it from a presentation I
>> gave on Hadoop and AWS in London last month, which you may find
>> interesting: http://skillsmatter.com/custom/presentations/ec2-talk.pdf.
>>
>> tim robertson <timrobertson100@gmail.com> wrote:
>>> For these small
>>> datasets, you might find it useful - let me know if I should spend
>>> time finishing it (Or submit help?) - it is really very simple.
>>
>> This sounds very useful. Please consider creating a Jira and
>> submitting the code (even if it's not "finished" folks might like to
>> see it). Thanks.
>>
>> Tom
>>
>>>
>>> Cheers
>>>
>>> Tim
>>>
>>>
>>>
>>> On Tue, Sep 2, 2008 at 2:22 PM, Ryan LeCompte <lecompte@gmail.com> wrote:
>>>> Hi Tim,
>>>>
>>>> Are you mostly just processing/parsing textual log files? How many
>>>> maps/reduces did you configure in your hadoop-ec2-env.sh file? How
>>>> many did you configure in your JobConf? Just trying to get an idea of
>>>> what to expect in terms of performance. I'm noticing that it takes
>>>> about 16 minutes to transfer about 15GB of textual uncompressed data
>>>> from S3 into HDFS after the cluster has started with 15 nodes. I was
>>>> expecting this to take a shorter amount of time, but maybe I'm
>>>> incorrect in my assumptions. I am also noticing that it takes about 15
>>>> minutes to parse through the 15GB of data with a 15 node cluster.
>>>>
>>>> Thanks,
>>>> Ryan
>>>>
>>>>
>>>> On Tue, Sep 2, 2008 at 3:29 AM, tim robertson <timrobertson100@gmail.com>
wrote:
>>>>> I have been processing only 100s GBs on EC2, not 1000's and using 20
>>>>> nodes and really only in exploration and testing phase right now.
>>>>>
>>>>>
>>>>> On Tue, Sep 2, 2008 at 8:44 AM, Andrew Hitchcock <adpowers@gmail.com>
wrote:
>>>>>> Hi Ryan,
>>>>>>
>>>>>> Just a heads up, if you require more than the 20 node limit, Amazon
>>>>>> provides a form to request a higher limit:
>>>>>>
>>>>>> http://www.amazon.com/gp/html-forms-controller/ec2-request
>>>>>>
>>>>>> Andrew
>>>>>>
>>>>>> On Mon, Sep 1, 2008 at 10:43 PM, Ryan LeCompte <lecompte@gmail.com>
wrote:
>>>>>>> Hello all,
>>>>>>>
>>>>>>> I'm curious to see how many people are using EC2 to execute their
>>>>>>> Hadoop cluster and map/reduce programs, and how many are using
>>>>>>> home-grown datacenters. It seems like the 20 node limit with
EC2 is a
>>>>>>> bit crippling when one wants to process many gigabytes of data.
Has
>>>>>>> anyone found this to be the case? How much data are people processing
>>>>>>> with their 20 node limit on EC2? Curious what the thoughts are...
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Ryan
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Mime
View raw message