Return-Path: X-Original-To: apmail-hbase-user-archive@www.apache.org Delivered-To: apmail-hbase-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 07FCA11C7D for ; Thu, 31 Jul 2014 15:18:20 +0000 (UTC) Received: (qmail 30182 invoked by uid 500); 31 Jul 2014 15:18:17 -0000 Delivered-To: apmail-hbase-user-archive@hbase.apache.org Received: (qmail 30113 invoked by uid 500); 31 Jul 2014 15:18:17 -0000 Mailing-List: contact user-help@hbase.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hbase.apache.org Delivered-To: mailing list user@hbase.apache.org Received: (qmail 30102 invoked by uid 99); 31 Jul 2014 15:18:17 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 31 Jul 2014 15:18:17 +0000 X-ASF-Spam-Status: No, hits=-0.0 required=5.0 tests=SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: local policy) Received: from [80.246.59.130] (HELO vps9612.alfahosting-vps.de) (80.246.59.130) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 31 Jul 2014 15:18:10 +0000 Received: from [192.168.1.2] (i5E86D32B.versanet.de [94.134.211.43]) by vps9612.alfahosting-vps.de (Postfix) with ESMTPA id 1D0DAB960167 for ; Thu, 31 Jul 2014 17:17:49 +0200 (CEST) Message-ID: <53DA5E20.2090709@cawoom.com> Date: Thu, 31 Jul 2014 17:17:52 +0200 From: Wilm Schumacher User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Thunderbird/31.0 MIME-Version: 1.0 To: user@hbase.apache.org Subject: hbase and hadoop (for "normal" hdfs) cluster together? Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-Virus-Checked: Checked by ClamAV on apache.org Hi, I have a "conceptional" question and would appreciate hints. My task is to save files to hdfs and to maintain some informations about them in a hbase db and then serve both to the application. Per file I have around 50 rows with 10 columns (in 2 column families) in the tables, which have string values of length around 100. The files have normal size (perhaps between some kB to 100 MB or so). By this estimation the number of files are way smaller than the the number of rows (times columns), but the space on disk is way larger for the files than the space for the hbase. I would further estimate, that for every get on a file there should be around hundreds of getRows on the hbase. For the files I want to run an hadoop cluster (obviously). The question now arises: should I run the hbase on the same hadoop cluster? The pro of running together is obvious: i would only have to run one hadoop cluster which would which would save time, money and nerves. On the other hand it wouldn't be possible to make special adjustments for optimizing the cluster for one or the other task. E.g. if I want to make the hbase more "distributed" by optimizing the replication (to let's say 6) I would have to use a doubled amount of disk for the "normal" files, too. So: what should I do? Do you have any comments or hints on this question Best wishes, wilm