hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Keith Thomas <keith.tho...@gmail.com>
Subject Re: On storing HBase data in AWS S3
Date Wed, 14 Oct 2009 02:58:03 GMT

I am a complete newbie to the wonderful world of Amazon services so I
apologize if I am asking a question that has already been answered.

I am looking for the easiest way to bring up an HBase and Hadoop environment
as the persistence mechanism for a Grails based web application. I was not
entirely clear which of the myriad of services offered provides the best
approach - EC2, S3, Elastic Map/Reduce, etc etc - until the previous post
pointed me towards EC2 over S3. 

Am I correct in understanding that a farm of EC2 instances with Hadoop and
HBase installed and configured individually by myself are the quickest and
most effective way to progress with this effort?

Jean-Daniel Cryans-2 wrote:
> Hi users,
> I've recently helped debugging a 0.19 HBase setup that was using S3 as
> its DFS (one of the problem is discussed in another thread) and I
> think I've gathered enough information to guide new users on whether
> this is a valuable solution.
> Short answer: don't use it for user-facing apps, consider it for
> elastic EC2 clusters.
> Long answer:
> The main reason why you would want to store your data inside S3 would
> be because of the marketed high availability and infinite scalability.
> As the website says: "It gives any developer access to the same highly
> scalable, reliable, fast, inexpensive data storage infrastructure that
> Amazon uses to run its own global network of web sites. The service
> aims to maximize benefits of scale and to pass those benefits on to
> developers." BTW I don't refute any of this as in my experience this
> has been mostly true.
> HBase can use any filesystem supported in Hadoop, including S3, so it
> seems like a no brainer to use it instead of having to setup Hadoop.
> Yes indeed, but...
> - You absolutely have to deploy your region servers in EC2 because of
> the obvious latency and bandwidth every filesystem access will occur.
> - The way the S3 code works in Hadoop, it writes on disk every inbound
> and outbound file. Apart from slowing down even more every operation,
> if you didn't change the hadoop.tmp.dir it will write in /tmp and that
> volume on EC2 is always very very small. In fact, the first thing I
> had to debug was a "No space left on device" which seems weird since
> S3 should have infinite storage, but the error was really given when
> data was written in the tmp folder.
> - There are some unknown interactions because HBase has a very
> different file usage pattern than MapReduce jobs and was optimized for
> HDFS, not distant networked storage.
> So if you need speed, simply don't use S3 with HBase as it will be too
> slow . You can consider using it for elastic MapReduce jobs the same
> way people use it with Hadoop because you don't have to keep all the
> nodes up all the time.
> J-D

View this message in context: http://www.nabble.com/On-storing-HBase-data-in-AWS-S3-tp25794704p25884592.html
Sent from the HBase User mailing list archive at Nabble.com.

View raw message