hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Friso van Vollenhoven <fvanvollenho...@xebia.com>
Subject Re: Realtime Query with hbase (aggregation)
Date Thu, 30 Dec 2010 12:59:15 GMT
If you need to do aggregates over substantial volumes of data, it will always take time. Scans
for sequential rows are typically IO bound, so faster IO could probably make you go a bit
faster, but on AWS I guess that's difficult.

If you really want to speed things up, you need to find a way to reduce the data volume without
losing information. This usually requires preprocessing of some kind. You could store a pre-calculated
average or count+sum pair per row in a separate family or table. This way, you'd only have
to scan the averages instead of all the data. Getting sub-second answers to questions that
require a lot of source data usually means storing derived data sets redundantly.

If you never or rarely have missing data points, you could also store a series of data points
in one value. So let the column qualifier just be a year or month and then store a list of
366 data points in the value. This would allow you to go through it more quickly. You'd just
need a magic value for missing data points, such that you can take care of those in software.
We do something similar using protobuf as storage format.

Point is that calculating averages is not computationally heavy and you're scanning through
data sequentially so the main bottleneck will be the amount of data that needs to come of
disk, since there are not a lot of seeks happening and CPU is not working very hard.

If the usage pattern requires the same data points to be queried often (if there is a hot
spot in the data), you could make things snappier by throwing more block cache at it.

Hope this helps,

On 28 dec 2010, at 20:11, Gangl, Michael E (388K) wrote:

> I¹m trying to do some realtime queries in HBASE over about 550 million
> points of data (~150K rows, ~3700 columns/row), and need some help/ideas on
> speeding it up.
> I have a dataset in HBASE that consist of the following:
> Row  a geohashed lat/lon value (there are 150k unique row keys)
> Column family 1: science data  There are a bunch of columns in this family,
> notable a year/day-of-year (doy) pair from 1999260 to 2009366 (so about 3700
> columns per family)
> It looks like this:
> Key          |1999266|1999267|1999268|  ...  |2009365 | 2009366
> -------------|-------|-------|-------|-------|--------|----------|
> ljkashd781aj |  -14  | -16   | -17   | ..... | -17.755| -17.5    |
> ljkavhd79ads |  -14  | -17   | -17   | ..... | -16.785| -16.665  |
> ljkcshd84q2d |  -15  | -16   | -17.35| ..... | -15.445| -12.34   |
> So if I wanted to look for the all-time data for a single point (let¹s say
> lat/lon 75,-135) I¹d geohash that point and do a get on it (we¹re assuming
> the point exists).
> I¹d get back the science column family which has ~3700 points (10 years *
> 365 days), and can plot this or do whatever I need just fine.
> The issue happens when I want to get the data over a larger region- I can
> select the region simply enough, I simply do a scan from the start point to
> then end point (a bounding box of lat/long values, where geohashes inside
> that box are in lexicographic order, this is all straight forward for me).
> The issue I have is aggregating the results and returning the average result
> for a year/doy pair.
> Currently I set up a scanner to go through each row, and for each key I
> parse the yeardoy/value and add this to a hash map- the yeardoy is a key
> into it, and then I have an object with sum and count (incremented each time
> I add to the sum, no more than +1 per row) fields. This lets me sum up the
> values and find the average by dividing the sum by the count and having the
> average value for a certain year/doy. But this takes a long time for larger
> regions (>1 minute, often times 2-3 minutes), and I'm trying to figure out
> the best way to do this-
> I currently have a 10 node hbase setup (c1.xlarge on AWS). LZO compression
> and caching the scan results to a certain extent (I've played with this but
> don't think it's speeding me up or slowing me down too much) I'm wondering
> if I should add more machines, or if there is a better way to get the
> results I'm looking for.
> I'm trying to do the aggregation in realtime, but understand I can submit
> jobs and return the results to a user at a later time if needs be.
> Sorry for the long email, just want to explain the problem clearly.
> -Mike

View raw message