hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mark Grover <>
Subject Re: Building out Hive in EC2/S3 versus dedicated servers
Date Tue, 22 Nov 2011 20:47:36 GMT
Here is another article that might be insightful for you:

Sam raised some valid points and going with Amazon definitely is a (relatively) hassle free
way to get started especially when one is constrained w.r.t resources related to managing
of internal cluster.


----- Original Message -----
From: "Sam Wilson" <>
Sent: Tuesday, November 22, 2011 3:38:01 PM
Subject: Re: Building out Hive in EC2/S3 versus dedicated servers

We recently adopted Hadoop and Hive for doing some significant data processing. We went the
Amazon route. 

My own $.02 is as follows: 

If you are already incredibly experienced with Hadoop and Hive and have someone on staff who
has previously built a cluster at least as big as the one you are projecting to require, then
simply do some back of the envelope calculations and decide if it is cost effective to run
on your own system given all your other business constraints. If you don't know how to do
this, then you aren't sufficiently experienced to go this route. 

If you are new to Hadoop and Hive, then your best bet is to build your application first,
using EMR as a prototype cluster. If your data is already loaded into S3 or you are already
using Amazon, then this is also a no brainer way to get started. Hadoop and Hive are not what
I would call user friendly. Frankly, they are full of bugs, and gotchas and are poorly documented.
The learning curve is a bit steep. The most important thing is to prove out your functionality
and build a system that delivers value quickly. You don't want your deadline to pass with
only a pretty rack of servers to show for it. You need functionality. 

EMR lets you focus on your application, your code, your requirements, without having to deal
with the details of the infrastructure. I simply cannot stress how nice it has been for us
to be able to spin up new clusters on-the-fly while we were developing our application. Our
ability to rapidly prototype has simply blown me away. 

Once you've got yourself up and running, your application is doing what it's supposed to,
and you've built some familiarity with Hadoop and Hive, my suggestion is to then build a prototype
cluster either hosted or in your office. Familiarize yourself with all the network, OS and
other low-level details. Do some analysis on cost/performance, then decide whether or not
to move your production system from Amazon to somewhere else. 

Everyone's application is going to be very unique to them, so looking at someone else's calculations
is largely pointless. 

In our experience how did this pan out? We rebuilt a major system component in 3 months, reducing
query times for certain jobs from 16+ days to 4 minutes. We did not purchase a single piece
of hardware, or install a single piece of software we did not write ourselves. We have the
ability to rapidly redeploy our system in any of 5 different data centers around the world
at the flip of a few switches. If we wanted to deploy on our own hardware or in a colo at
this point, we would only have to focus on building the cluster. 

Our app is already built, serving our customers and making us money. 


On Nov 22, 2011, at 3:15 PM, Loren Siebert wrote: 

My colleague has a Heroku-based startup and they are just getting started with Hadoop and
Hive. They’re evaluating running Hive in EC2/S3 versus buying a handful of boxes and installing

One nice (albeit dated) analysis on this question is here, but I’m curious if anyone here
has a different take on it:

What is the sweet spot for when a Hive warehouse in EC2 makes the most sense? 

I’m asking on this Hive list versus the more general Hadoop lists because I think a solution
for a Hive cluster could differ quite a bit from a solution for a HBase cluster. 

- Loren 

View raw message