Mailing-List: contact hbase-user-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: hbase-user@hadoop.apache.org
Received-SPF: pass (nike.apache.org: domain of jdcryans@gmail.com designates
 209.85.220.222 as permitted sender)
DomainKey-Signature: a=rsa-sha1; c=nofws;
        d=gmail.com; s=gamma;
        h=mime-version:sender:date:x-google-sender-auth:message-id:subject
         :from:to:content-type;
        b=ip9R1ZMU9a6X2aUJnDtd+2Pe+X2EqUmAUKGHoYk1Zfq3Qbr5yzqCv42P90LH4v8Key
         HkoT1B5cHLT6RggZxvYSNtgrPNQtFAr1oKULEf5MocESJAZqn9kEvEAD5+dRgD7OD7mn
         +4qOhDz9NDmfLQLPS4Iams8B9TJlw3QwoT26c=
MIME-Version: 1.0
Sender: jdcryans@gmail.com
Date: Wed, 7 Oct 2009 17:21:02 -0400
Message-ID: <31a243e70910071421s1ac115f6sae0d7da344639ba1@mail.gmail.com>
Subject: On storing HBase data in AWS S3
From: Jean-Daniel Cryans <jdcryans@apache.org>
To: hbase-user@hadoop.apache.org
Content-Type: text/plain; charset=ISO-8859-1

Hi users,

I've recently helped debugging a 0.19 HBase setup that was using S3 as
its DFS (one of the problem is discussed in another thread) and I
think I've gathered enough information to guide new users on whether
this is a valuable solution.

Short answer: don't use it for user-facing apps, consider it for
elastic EC2 clusters.

Long answer:

The main reason why you would want to store your data inside S3 would
be because of the marketed high availability and infinite scalability.
As the website says: "It gives any developer access to the same highly
scalable, reliable, fast, inexpensive data storage infrastructure that
Amazon uses to run its own global network of web sites. The service
aims to maximize benefits of scale and to pass those benefits on to
developers." BTW I don't refute any of this as in my experience this
has been mostly true.

HBase can use any filesystem supported in Hadoop, including S3, so it
seems like a no brainer to use it instead of having to setup Hadoop.
Yes indeed, but...

- You absolutely have to deploy your region servers in EC2 because of
the obvious latency and bandwidth every filesystem access will occur.
- The way the S3 code works in Hadoop, it writes on disk every inbound
and outbound file. Apart from slowing down even more every operation,
if you didn't change the hadoop.tmp.dir it will write in /tmp and that
volume on EC2 is always very very small. In fact, the first thing I
had to debug was a "No space left on device" which seems weird since
S3 should have infinite storage, but the error was really given when
data was written in the tmp folder.
- There are some unknown interactions because HBase has a very
different file usage pattern than MapReduce jobs and was optimized for
HDFS, not distant networked storage.

So if you need speed, simply don't use S3 with HBase as it will be too
slow . You can consider using it for elastic MapReduce jobs the same
way people use it with Hadoop because you don't have to keep all the
nodes up all the time.

J-D