hadoop-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jagane Sundar <jag...@apache.org>
Subject Re: Which proposed distro of Hadoop, 0.20.206 or 0.22, will be better for HBase?
Date Thu, 06 Oct 2011 02:00:43 GMT
Thanks for your input, Milind. It's very useful and interesting.

In the interest of brevity, I have truncated most of it except for the point
regarding 'cloud friendly'. I have done some research into this, and want to
get some more community feedback.

>2. Built in support for the cloud. (Whirr is interesting. Ambari more so,
> >but both fall short.)
> Not very sure. If by "support for the cloud" means ability to provision
> atop a hypervisor, adding or removing instances etc, I think there are
> other approaches proven in the industry.
There are two aspects to cloud friendliness - deployment
technologies/automation, and storage.

As far as deployment automation is concerned, I am eager to know what other
approaches you are familiar with. Chef/Puppet et. al. are not interesting to
me. I want this to have end user self-serve service characteristics, not
'end users file ticket, sysadmin runs [chef|puppet|other] script'.

Storage is very interesting. My own thoughts, from analyzing EC2 and EMR are
as follows. (A lot of the following is speculation and educated guesswork,
so I may be totally off, but here it is anyway):

Amazon's philosophy is totally 'on-demand bring up when needed and tear down
when done'. I like this philosophy a lot. However it does not work well for
storage. Storage needs to be always up and available. Hence, they took
Hadoop, stripped off HDFS and built a shim to S3, their object storage
service. There is no posix there. Map Reduce jobs run in VMs that are
brought up on demand, and access the S3 hosted files using the protocol s3n
(n stands for native - that's native to s3 not native to Hadoop). When this
turned out to be slow as sh**, they seem to have hacked the HDFS layer some
more, in order to actually have a NameNode for metadata, but to use S3 for
storing blocks. They have a protocol s3 to access this. Both of these
approaches have one severe failing - they do not support Append and Hflush.
ergo - no HBase on EMR. I am sure they are working furiously to address this
shortcoming and add append/hflush support to s3n or s3, in order to make it
possible to run HBase on EMR. In the meantime, anecdotal evidence suggests
that at least half of Amazon's customers are opting to use Apache Hadoop on
EC2 VMs with EBS storage (completely bypassing the EMR offering). EBS itself
is an interesting storage technology. It is block storage offered over the
ethernet network, from an occassionally sync'd local disk elsewhere. EBS has
some storage resiliency built in, so the question of how many replicas when
HDFS is built on top of this is very interesting.

This problem of offering a cost effective Hadoop as an on-demand self
service offering in the cloud is very interesting. This is a nut I want to

Sorry about the long rant, and again, it is all in the hope that I can evoke
some postings from people who know more about this than I do.


  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message