Mailing-List: contact user-help@hbase.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@hbase.apache.org
Received-SPF: pass (athena.apache.org: local policy)
From: Friso van Vollenhoven <fvanvollenhoven@xebia.com>
To: "<user@hbase.apache.org>" <user@hbase.apache.org>
Subject: Re: Realtime Query with hbase (aggregation)
Thread-Topic: Realtime Query with hbase (aggregation)
Thread-Index: AQHLpsN0L59b0OK6mUesqD1b4eewspO45G6A
Date: Thu, 30 Dec 2010 12:59:15 +0000
Message-ID: <E4D6F8E7-3B2F-4033-8A35-015CA62566A9@xebia.com>
References: <C93F7676.2B644%Michael.E.Gangl@jpl.nasa.gov>
In-Reply-To: <C93F7676.2B644%Michael.E.Gangl@jpl.nasa.gov>
Accept-Language: nl-NL, en-US
Content-Language: en-US
Content-Type: text/plain; charset="iso-8859-1"
Content-ID: <15db3a7e-1096-47cd-8b88-1708bf72d245>
Content-Transfer-Encoding: quoted-printable
MIME-Version: 1.0

If you need to do aggregates over substantial volumes of data, it will alwa=
ys take time. Scans for sequential rows are typically IO bound, so faster I=
O could probably make you go a bit faster, but on AWS I guess that's diffic=
ult.

If you really want to speed things up, you need to find a way to reduce the=
 data volume without losing information. This usually requires preprocessin=
g of some kind. You could store a pre-calculated average or count+sum pair =
per row in a separate family or table. This way, you'd only have to scan th=
e averages instead of all the data. Getting sub-second answers to questions=
 that require a lot of source data usually means storing derived data sets =
redundantly.

If you never or rarely have missing data points, you could also store a ser=
ies of data points in one value. So let the column qualifier just be a year=
 or month and then store a list of 366 data points in the value. This would=
 allow you to go through it more quickly. You'd just need a magic value for=
 missing data points, such that you can take care of those in software. We =
do something similar using protobuf as storage format.

Point is that calculating averages is not computationally heavy and you're =
scanning through data sequentially so the main bottleneck will be the amoun=
t of data that needs to come of disk, since there are not a lot of seeks ha=
ppening and CPU is not working very hard.

If the usage pattern requires the same data points to be queried often (if =
there is a hot spot in the data), you could make things snappier by throwin=
g more block cache at it.


Hope this helps,
Friso


On 28 dec 2010, at 20:11, Gangl, Michael E (388K) wrote:

> I=B9m trying to do some realtime queries in HBASE over about 550 million
> points of data (~150K rows, ~3700 columns/row), and need some help/ideas =
on
> speeding it up.
>=20
> I have a dataset in HBASE that consist of the following:
>=20
> Row  a geohashed lat/lon value (there are 150k unique row keys)
>=20
> Column family 1: science data  There are a bunch of columns in this famil=
y,
> notable a year/day-of-year (doy) pair from 1999260 to 2009366 (so about 3=
700
> columns per family)
>=20
> It looks like this:
>=20
> Key          |1999266|1999267|1999268|  ...  |2009365 | 2009366
> -------------|-------|-------|-------|-------|--------|----------|
> ljkashd781aj |  -14  | -16   | -17   | ..... | -17.755| -17.5    |
> ljkavhd79ads |  -14  | -17   | -17   | ..... | -16.785| -16.665  |
> ljkcshd84q2d |  -15  | -16   | -17.35| ..... | -15.445| -12.34   |
>=20
>=20
> So if I wanted to look for the all-time data for a single point (let=B9s =
say
> lat/lon 75,-135) I=B9d geohash that point and do a get on it (we=B9re ass=
uming
> the point exists).
>=20
> I=B9d get back the science column family which has ~3700 points (10 years=
 *
> 365 days), and can plot this or do whatever I need just fine.
>=20
> The issue happens when I want to get the data over a larger region- I can
> select the region simply enough, I simply do a scan from the start point =
to
> then end point (a bounding box of lat/long values, where geohashes inside
> that box are in lexicographic order, this is all straight forward for me)=
.
> The issue I have is aggregating the results and returning the average res=
ult
> for a year/doy pair.
>=20
> Currently I set up a scanner to go through each row, and for each key I
> parse the yeardoy/value and add this to a hash map- the yeardoy is a key
> into it, and then I have an object with sum and count (incremented each t=
ime
> I add to the sum, no more than +1 per row) fields. This lets me sum up th=
e
> values and find the average by dividing the sum by the count and having t=
he
> average value for a certain year/doy. But this takes a long time for larg=
er
> regions (>1 minute, often times 2-3 minutes), and I'm trying to figure ou=
t
> the best way to do this-
>=20
> I currently have a 10 node hbase setup (c1.xlarge on AWS). LZO compressio=
n
> and caching the scan results to a certain extent (I've played with this b=
ut
> don't think it's speeding me up or slowing me down too much) I'm wonderin=
g
> if I should add more machines, or if there is a better way to get the
> results I'm looking for.
>=20
> I'm trying to do the aggregation in realtime, but understand I can submit
> jobs and return the results to a user at a later time if needs be.
>=20
> Sorry for the long email, just want to explain the problem clearly.
>=20
> -Mike
>=20