Mailing-List: contact user-help@hbase.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@hbase.apache.org
Received-SPF: pass (athena.apache.org: domain of cgandevia@gmail.com
 designates 74.125.82.49 as permitted sender)
MIME-Version: 1.0
From: Cameron Gandevia <cgandevia@gmail.com>
Date: Fri, 26 Apr 2013 12:49:01 -0700
Message-ID: 
 <CAAYV3ta9BqyD=ZwJM8HM8H_+N7=h1w642c2HCXOSOCC9KvgACg@mail.gmail.com>
Subject: Schema Design Question
To: user@hbase.apache.org
Content-Type: multipart/alternative; boundary=f46d043bdce6059f9704db48d63c

--f46d043bdce6059f9704db48d63c
Content-Type: text/plain; charset=UTF-8

Hi

I am new to HBase, I have been trying to POC an application and have a
design questions.

Currently we have a single table with the following key design

jobId_batchId_bundleId_uniquefileId

This is an offline processing system so data would be bulk loaded into
HBase via map/reduce jobs. We only need to support report generation
queries using map/reduce over a batch (And possibly a single column filter)
with the batchId as the start/end scan key. Once we have finished
processing a job we are free to remove the data from HBase.

We have varied workloads so a job could be made up of 10 rows, 100,000 rows
or 1 billion rows with the average falling somewhere around 10 million rows.

My question is related to pre-splitting. If we have a billion rows all with
the same batchId (Our map/reduce scan key) my understanding is we should
perform pre-splitting to create buckets hosted by different regions. If a
jobs workload can be so varied would it make sense to have a single table
containing all jobs? Or should we create 1 table per job and pre-split the
table for the given workload? If we had separate table we could drop them
when no longer needed.

If we didn't have a separate table per job how should we perform splitting?
Should we choose our largest possible workload and split for that? even
though 90% of our jobs would fall in the lower bound in terms of row count.
Would we experience any issue purging jobs of varying sizes if everything
was in a single table?

any advice would be greatly appreciated.

Thanks

--f46d043bdce6059f9704db48d63c--