hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "tim robertson" <timrobertson...@gmail.com>
Subject Re: Hadoop & EC2
Date Tue, 02 Sep 2008 12:47:12 GMT
Hi Ryan,

I actually blogged my experience as it was my first usage of EC2:

My input data was not log files but actually a dump if 150million
records from Mysql into about 13 columns of tab file data I believe.
It was a couple of months ago, but I remember thinking S3 was very slow...

I ran some simple operations like distinct values of one column based
on another (species within a cell) and also did some Polygon analysis
since to do "is this point in this polygon" does not really scale too
well in PostGIS.

Incidentally, I have most of the basics of a "MapReduce-Lite" which I
aim to port to use the exact Hadoop API since I am *only* working on
10's-100's GB of data and find that it is running really fine on my
laptop and I don't need the distributed failover.  My goal for that
code is for people like me who want to know that I can scale to
terrabyte processing, but don't need to take the plunge to full Hadoop
deployment yet, but will know that I can migrate the processing in the
future as  things grow.  It runs on the normal filesystem, and single
node only (e.g. multithreaded), and performs very quickly since it is
just doing java NIO bytebuffers in parallel on the underlying
filesystem - on my laptop I Map+Sort+Combine about 130,000 jobs a
seconds (simplest of simple map operations).  For these small
datasets, you might find it useful - let me know if I should spend
time finishing it (Or submit help?) - it is really very simple.



On Tue, Sep 2, 2008 at 2:22 PM, Ryan LeCompte <lecompte@gmail.com> wrote:
> Hi Tim,
> Are you mostly just processing/parsing textual log files? How many
> maps/reduces did you configure in your hadoop-ec2-env.sh file? How
> many did you configure in your JobConf? Just trying to get an idea of
> what to expect in terms of performance. I'm noticing that it takes
> about 16 minutes to transfer about 15GB of textual uncompressed data
> from S3 into HDFS after the cluster has started with 15 nodes. I was
> expecting this to take a shorter amount of time, but maybe I'm
> incorrect in my assumptions. I am also noticing that it takes about 15
> minutes to parse through the 15GB of data with a 15 node cluster.
> Thanks,
> Ryan
> On Tue, Sep 2, 2008 at 3:29 AM, tim robertson <timrobertson100@gmail.com> wrote:
>> I have been processing only 100s GBs on EC2, not 1000's and using 20
>> nodes and really only in exploration and testing phase right now.
>> On Tue, Sep 2, 2008 at 8:44 AM, Andrew Hitchcock <adpowers@gmail.com> wrote:
>>> Hi Ryan,
>>> Just a heads up, if you require more than the 20 node limit, Amazon
>>> provides a form to request a higher limit:
>>> http://www.amazon.com/gp/html-forms-controller/ec2-request
>>> Andrew
>>> On Mon, Sep 1, 2008 at 10:43 PM, Ryan LeCompte <lecompte@gmail.com> wrote:
>>>> Hello all,
>>>> I'm curious to see how many people are using EC2 to execute their
>>>> Hadoop cluster and map/reduce programs, and how many are using
>>>> home-grown datacenters. It seems like the 20 node limit with EC2 is a
>>>> bit crippling when one wants to process many gigabytes of data. Has
>>>> anyone found this to be the case? How much data are people processing
>>>> with their 20 node limit on EC2? Curious what the thoughts are...
>>>> Thanks,
>>>> Ryan

View raw message