hadoop-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jagane Sundar <jag...@apache.org>
Subject Hadoop as a Big Data app for the cloud
Date Thu, 06 Oct 2011 17:51:29 GMT
Note that I have changed the subject to be more relevant.

> I've just started the wiki page on this topic:
> http://wiki.apache.org/hadoop/**Virtual%20Hadoop<http://wiki.apache.org/hadoop/Virtual%20Hadoop>
> I will take a look at this wiki, and hopefully, contribute to it, Steve.

>  There are two aspects to cloud friendliness - deployment
>> technologies/automation, and storage.
> -agility to handle the failure modes of cloud infrastructure

Good point. Amazon EMR starts billing for your EMR job, when 90% of the
compute VMs have fired up. They too, seem to acknowledge the possibility of

-security in a shared infrastructure

Security is a valid concern for the public cloud. An internal homebrew
openstack based cloud may not need to worry as much security.
That said, a networking construct such as Amazon VPC goes a long way towards
isolating the Hadoop.

-flexibility based on demand
> Right.

>  As far as deployment automation is concerned, I am eager to know what
>> other
>> approaches you are familiar with. Chef/Puppet et. al. are not interesting
>> to
>> me. I want this to have end user self-serve service characteristics, not
>> 'end users file ticket, sysadmin runs [chef|puppet|other] script'.
> done this with a web UI: ask for the #of machines, bring up NN/JT/single DN
> master node, once that is up bring up the workers with a config that
> includes the hostname of the master node.
A person with database background who wants to use Hbase for his Big Data
processing will find the whole NN/JT/ZK etc. etc. overwhelming. Much of this
can be hidden. I think there is much work to be done in making Hadoop easier
to use.

> that at least half of Amazon's customers are opting to use Apache Hadoop on
>> EC2 VMs with EBS storage (completely bypassing the EMR offering).
> More expensive, but more flexible in terms of what you can run
Not clear that EBS is more expensive. If you buy the argument that EBS is
resilient storage, and one HDFS replica is adequate, then it turns out to be
ten cents a GB-month, versus fifteen cents a GB-month for S3.

> Summary: I'm not sure that HDFS is the right FS in this world, as it
> contains a lot of assumptions about system stability and HDD persistence
> that aren't valid any more. With the ability to plug in new placers you
> could do tricks like ensure 1 replica lives in a persistent blockstore (and
> rely on it always being there), and add other replicas in transient storage
> if the data is about to be needed in jobs.

I would be loathe to using anything other than an official Apache Hadoop and
its HDFS. My estimate is that various companies are going to pour in about
200 Million dollars to develop Apache Hadoop. That kind of money brings in
very very smart engineers. To benefit from that ecosystem, stick with an
Apache Hadoop from the community. As a counter point, witness the quandary
Amazon is in. They are unable to react fast enough to the rise in popularity
of HBase because they chose to go with their own file system alternative to


  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message