Mailing-List: contact user-help@hbase.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@hbase.apache.org
Received-SPF: pass (athena.apache.org: local policy)
Message-ID: <53DA5E20.2090709@cawoom.com>
Date: Thu, 31 Jul 2014 17:17:52 +0200
From: Wilm Schumacher <wilm.schumacher@cawoom.com>
User-Agent: Mozilla/5.0 (X11; Linux x86_64;
 rv:31.0) Gecko/20100101 Thunderbird/31.0
MIME-Version: 1.0
To: user@hbase.apache.org
Subject: hbase and hadoop (for "normal" hdfs) cluster together?
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit

Hi,

I have a "conceptional" question and would appreciate hints.

My task is to save files to hdfs and to maintain some informations about
them in a hbase db and then serve both to the application.

Per file I have around 50 rows with 10 columns (in 2 column families) in
the tables, which have string values of length around 100.

The files have normal size (perhaps between some kB to 100 MB or so).

By this estimation the number of files are way smaller than the the
number of rows (times columns), but the space on disk is way larger for
the files than the space for the hbase. I would further estimate, that
for every get on a file there should be around hundreds of getRows on
the hbase.

For the files I want to run an hadoop cluster (obviously). The question
now arises: should I run the hbase on the same hadoop cluster?

The pro of running together is obvious: i would only have to run one
hadoop cluster which would which would save time, money and nerves.

On the other hand it wouldn't be possible to make special adjustments
for optimizing the cluster for one or the other task. E.g. if I want to
make the hbase more "distributed" by optimizing the replication (to
let's say 6) I would have to use a doubled amount of disk for the
"normal" files, too.

So: what should I do?

Do you have any comments or hints on this question

Best wishes,

wilm