hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ananth T. Sarathy" <ananth.t.sara...@gmail.com>
Subject Re: On storing HBase data in AWS S3
Date Wed, 07 Oct 2009 21:29:12 GMT
Yeah, first of all a lot of thanks for your help. We are moving over to
version .20.0 now, and we'll let you know how that goes.  We've learned a
few things in the process so we hope to get that ready soon.

 It seems to work fairly well inside the cloud.  As far as apps outside the
cloud, we have noticed a lot of weirdness with configuring it (which has
more to do with us I am sure) which includes region servers binding to the
internal IP address rather than the external which makes it hard to test and
develop against from outside the cloud.

Additionally a couple of other pointers you need to download the jets3t and
the commons-codec to your lib directory to connect to s3, and you need your
core-site.xml to have all your s3 login info.


Ananth T Sarathy


On Wed, Oct 7, 2009 at 5:21 PM, Jean-Daniel Cryans <jdcryans@apache.org>wrote:

> Hi users,
>
> I've recently helped debugging a 0.19 HBase setup that was using S3 as
> its DFS (one of the problem is discussed in another thread) and I
> think I've gathered enough information to guide new users on whether
> this is a valuable solution.
>
> Short answer: don't use it for user-facing apps, consider it for
> elastic EC2 clusters.
>
> Long answer:
>
> The main reason why you would want to store your data inside S3 would
> be because of the marketed high availability and infinite scalability.
> As the website says: "It gives any developer access to the same highly
> scalable, reliable, fast, inexpensive data storage infrastructure that
> Amazon uses to run its own global network of web sites. The service
> aims to maximize benefits of scale and to pass those benefits on to
> developers." BTW I don't refute any of this as in my experience this
> has been mostly true.
>
> HBase can use any filesystem supported in Hadoop, including S3, so it
> seems like a no brainer to use it instead of having to setup Hadoop.
> Yes indeed, but...
>
> - You absolutely have to deploy your region servers in EC2 because of
> the obvious latency and bandwidth every filesystem access will occur.
> - The way the S3 code works in Hadoop, it writes on disk every inbound
> and outbound file. Apart from slowing down even more every operation,
> if you didn't change the hadoop.tmp.dir it will write in /tmp and that
> volume on EC2 is always very very small. In fact, the first thing I
> had to debug was a "No space left on device" which seems weird since
> S3 should have infinite storage, but the error was really given when
> data was written in the tmp folder.
> - There are some unknown interactions because HBase has a very
> different file usage pattern than MapReduce jobs and was optimized for
> HDFS, not distant networked storage.
>
> So if you need speed, simply don't use S3 with HBase as it will be too
> slow . You can consider using it for elastic MapReduce jobs the same
> way people use it with Hadoop because you don't have to keep all the
> nodes up all the time.
>
> J-D
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message