Return-Path: Delivered-To: apmail-hbase-user-archive@www.apache.org Received: (qmail 67684 invoked from network); 30 Dec 2010 12:59:50 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 30 Dec 2010 12:59:50 -0000 Received: (qmail 69838 invoked by uid 500); 30 Dec 2010 12:59:49 -0000 Delivered-To: apmail-hbase-user-archive@hbase.apache.org Received: (qmail 69554 invoked by uid 500); 30 Dec 2010 12:59:49 -0000 Mailing-List: contact user-help@hbase.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hbase.apache.org Delivered-To: mailing list user@hbase.apache.org Received: (qmail 69546 invoked by uid 99); 30 Dec 2010 12:59:48 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 30 Dec 2010 12:59:48 +0000 X-ASF-Spam-Status: No, hits=-0.0 required=10.0 tests=SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: local policy) Received: from [93.94.224.195] (HELO owa.exchange-login.net) (93.94.224.195) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 30 Dec 2010 12:59:40 +0000 Received: from HC1.hosted.exchange-login.net (93.94.224.200) by edge2.hosted.exchange-login.net (93.94.224.195) with Microsoft SMTP Server (TLS) id 14.0.702.0; Thu, 30 Dec 2010 13:59:18 +0100 Received: from MBX1.hosted.exchange-login.net ([fe80::a957:8775:7bf4:6581]) by hc1.hosted.exchange-login.net ([2002:5d5e:e0c8::5d5e:e0c8]) with mapi; Thu, 30 Dec 2010 13:59:17 +0100 From: Friso van Vollenhoven To: "" Subject: Re: Realtime Query with hbase (aggregation) Thread-Topic: Realtime Query with hbase (aggregation) Thread-Index: AQHLpsN0L59b0OK6mUesqD1b4eewspO45G6A Date: Thu, 30 Dec 2010 12:59:15 +0000 Message-ID: References: In-Reply-To: Accept-Language: nl-NL, en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: Content-Type: text/plain; charset="iso-8859-1" Content-ID: <15db3a7e-1096-47cd-8b88-1708bf72d245> Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 If you need to do aggregates over substantial volumes of data, it will alwa= ys take time. Scans for sequential rows are typically IO bound, so faster I= O could probably make you go a bit faster, but on AWS I guess that's diffic= ult. If you really want to speed things up, you need to find a way to reduce the= data volume without losing information. This usually requires preprocessin= g of some kind. You could store a pre-calculated average or count+sum pair = per row in a separate family or table. This way, you'd only have to scan th= e averages instead of all the data. Getting sub-second answers to questions= that require a lot of source data usually means storing derived data sets = redundantly. If you never or rarely have missing data points, you could also store a ser= ies of data points in one value. So let the column qualifier just be a year= or month and then store a list of 366 data points in the value. This would= allow you to go through it more quickly. You'd just need a magic value for= missing data points, such that you can take care of those in software. We = do something similar using protobuf as storage format. Point is that calculating averages is not computationally heavy and you're = scanning through data sequentially so the main bottleneck will be the amoun= t of data that needs to come of disk, since there are not a lot of seeks ha= ppening and CPU is not working very hard. If the usage pattern requires the same data points to be queried often (if = there is a hot spot in the data), you could make things snappier by throwin= g more block cache at it. Hope this helps, Friso On 28 dec 2010, at 20:11, Gangl, Michael E (388K) wrote: > I=B9m trying to do some realtime queries in HBASE over about 550 million > points of data (~150K rows, ~3700 columns/row), and need some help/ideas = on > speeding it up. >=20 > I have a dataset in HBASE that consist of the following: >=20 > Row a geohashed lat/lon value (there are 150k unique row keys) >=20 > Column family 1: science data There are a bunch of columns in this famil= y, > notable a year/day-of-year (doy) pair from 1999260 to 2009366 (so about 3= 700 > columns per family) >=20 > It looks like this: >=20 > Key |1999266|1999267|1999268| ... |2009365 | 2009366 > -------------|-------|-------|-------|-------|--------|----------| > ljkashd781aj | -14 | -16 | -17 | ..... | -17.755| -17.5 | > ljkavhd79ads | -14 | -17 | -17 | ..... | -16.785| -16.665 | > ljkcshd84q2d | -15 | -16 | -17.35| ..... | -15.445| -12.34 | >=20 >=20 > So if I wanted to look for the all-time data for a single point (let=B9s = say > lat/lon 75,-135) I=B9d geohash that point and do a get on it (we=B9re ass= uming > the point exists). >=20 > I=B9d get back the science column family which has ~3700 points (10 years= * > 365 days), and can plot this or do whatever I need just fine. >=20 > The issue happens when I want to get the data over a larger region- I can > select the region simply enough, I simply do a scan from the start point = to > then end point (a bounding box of lat/long values, where geohashes inside > that box are in lexicographic order, this is all straight forward for me)= . > The issue I have is aggregating the results and returning the average res= ult > for a year/doy pair. >=20 > Currently I set up a scanner to go through each row, and for each key I > parse the yeardoy/value and add this to a hash map- the yeardoy is a key > into it, and then I have an object with sum and count (incremented each t= ime > I add to the sum, no more than +1 per row) fields. This lets me sum up th= e > values and find the average by dividing the sum by the count and having t= he > average value for a certain year/doy. But this takes a long time for larg= er > regions (>1 minute, often times 2-3 minutes), and I'm trying to figure ou= t > the best way to do this- >=20 > I currently have a 10 node hbase setup (c1.xlarge on AWS). LZO compressio= n > and caching the scan results to a certain extent (I've played with this b= ut > don't think it's speeding me up or slowing me down too much) I'm wonderin= g > if I should add more machines, or if there is a better way to get the > results I'm looking for. >=20 > I'm trying to do the aggregation in realtime, but understand I can submit > jobs and return the results to a user at a later time if needs be. >=20 > Sorry for the long email, just want to explain the problem clearly. >=20 > -Mike >=20