hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael Segel <michael_se...@hotmail.com>
Subject Question about table size limitations...
Date Fri, 15 Apr 2011 20:01:29 GMT

I have a question concerning if or what is the practical
size limitations of a Table in Hbase.

I was asked ‘how many rows can one reasonably expect HBase
to handle…’ and the person with the question didn’t like my “It depends…”
answer. (I’m a consultant and the answer to every problem has to start with a “It
depends…” caveat. :-)

In trying to ascertain a practical answer, I’ve created this
hypothetical problem and hopefully someone with a bit more insight and
knowledge can provide a better answer. 
Please note that this is a hypothetical example and any resemblance to a
real life problem is a coincidence.

We have a fleet of petroleum exploration vessels. Each
vessel tows a set of sonar buoys to take measurements of the ocean’s floor.  Overlapping
of searches can occur.
(Crisscross patterns)

So our data sets contain both a geospatial aspect along with
a time series aspect. The complete data set of a single ocean can be large.
Measured in 10's of PBs.

There are two know use cases:

For a given ‘sweep’ process that data set.
(Sweep is a data set for a given ship in a given grid space for a  single day where we know
the start and end
times of the sweep.)

For a given grid_id (geo spatial box) process
all of the data collected by all of the sweeps that occurred. (Different ships,
dates, etc …)

Having said all of that… how much data can we store in a
table? How many rows?

Assume that the data set per time interval per buoy is 1K in
size and that there are going to be billions of these data points in the
database. (And we can store each buoy’s result in a different column of the

What I’d like to have is some sort of formula that we can
use to help determine a realistic size limit before performance falls

There’s more to this but the idea is to explore HBase’s
capabilities and limitations. We need to know this because we'd like to plan for any problems
and design to avoid them without having to try and test this solution without having to buy
and build a 2000 node cluster... 



PS. JDCryans, does this help explain the problem?

View raw message