Return-Path: Delivered-To: apmail-hadoop-hbase-user-archive@minotaur.apache.org Received: (qmail 52009 invoked from network); 7 Oct 2009 21:21:33 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 7 Oct 2009 21:21:33 -0000 Received: (qmail 79666 invoked by uid 500); 7 Oct 2009 21:21:32 -0000 Delivered-To: apmail-hadoop-hbase-user-archive@hadoop.apache.org Received: (qmail 79638 invoked by uid 500); 7 Oct 2009 21:21:32 -0000 Mailing-List: contact hbase-user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: hbase-user@hadoop.apache.org Delivered-To: mailing list hbase-user@hadoop.apache.org Received: (qmail 79628 invoked by uid 99); 7 Oct 2009 21:21:32 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 07 Oct 2009 21:21:32 +0000 X-ASF-Spam-Status: No, hits=-0.0 required=10.0 tests=SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of jdcryans@gmail.com designates 209.85.220.222 as permitted sender) Received: from [209.85.220.222] (HELO mail-fx0-f222.google.com) (209.85.220.222) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 07 Oct 2009 21:21:23 +0000 Received: by fxm22 with SMTP id 22so5201671fxm.36 for ; Wed, 07 Oct 2009 14:21:02 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:sender:received:date :x-google-sender-auth:message-id:subject:from:to:content-type; bh=MjeE54zb4YZTCkEORCWXe1yC+yVr5qqbCtUCL3c2QUA=; b=N1O3F12wStM82/2CMGv70j2L2s3X49CGvjfI9Z6kgBHTltQq+ObvEaKokc17vOBX+x xpv4y9jAui0M2ujXBaPtXWoN0QW1fIMi6FQI/+cRsaaYjzKlkAtdI5LSvjNNmeoOz45Z dC9CZ7Wg2V+tXrIUXhzTepOzdFu2AFfetpaGU= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:sender:date:x-google-sender-auth:message-id:subject :from:to:content-type; b=ip9R1ZMU9a6X2aUJnDtd+2Pe+X2EqUmAUKGHoYk1Zfq3Qbr5yzqCv42P90LH4v8Key HkoT1B5cHLT6RggZxvYSNtgrPNQtFAr1oKULEf5MocESJAZqn9kEvEAD5+dRgD7OD7mn +4qOhDz9NDmfLQLPS4Iams8B9TJlw3QwoT26c= MIME-Version: 1.0 Sender: jdcryans@gmail.com Received: by 10.223.4.148 with SMTP id 20mr159281far.0.1254950462685; Wed, 07 Oct 2009 14:21:02 -0700 (PDT) Date: Wed, 7 Oct 2009 17:21:02 -0400 X-Google-Sender-Auth: 94451c7a69a891ef Message-ID: <31a243e70910071421s1ac115f6sae0d7da344639ba1@mail.gmail.com> Subject: On storing HBase data in AWS S3 From: Jean-Daniel Cryans To: hbase-user@hadoop.apache.org Content-Type: text/plain; charset=ISO-8859-1 X-Virus-Checked: Checked by ClamAV on apache.org Hi users, I've recently helped debugging a 0.19 HBase setup that was using S3 as its DFS (one of the problem is discussed in another thread) and I think I've gathered enough information to guide new users on whether this is a valuable solution. Short answer: don't use it for user-facing apps, consider it for elastic EC2 clusters. Long answer: The main reason why you would want to store your data inside S3 would be because of the marketed high availability and infinite scalability. As the website says: "It gives any developer access to the same highly scalable, reliable, fast, inexpensive data storage infrastructure that Amazon uses to run its own global network of web sites. The service aims to maximize benefits of scale and to pass those benefits on to developers." BTW I don't refute any of this as in my experience this has been mostly true. HBase can use any filesystem supported in Hadoop, including S3, so it seems like a no brainer to use it instead of having to setup Hadoop. Yes indeed, but... - You absolutely have to deploy your region servers in EC2 because of the obvious latency and bandwidth every filesystem access will occur. - The way the S3 code works in Hadoop, it writes on disk every inbound and outbound file. Apart from slowing down even more every operation, if you didn't change the hadoop.tmp.dir it will write in /tmp and that volume on EC2 is always very very small. In fact, the first thing I had to debug was a "No space left on device" which seems weird since S3 should have infinite storage, but the error was really given when data was written in the tmp folder. - There are some unknown interactions because HBase has a very different file usage pattern than MapReduce jobs and was optimized for HDFS, not distant networked storage. So if you need speed, simply don't use S3 with HBase as it will be too slow . You can consider using it for elastic MapReduce jobs the same way people use it with Hadoop because you don't have to keep all the nodes up all the time. J-D