hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Oliver Meyn <oliver.m...@zerofootprint.net>
Subject performance using versions as dimension
Date Wed, 19 May 2010 20:52:52 GMT
Hi All,

I'm new to hbase and columnar storage schemas, so any comments you  
have on the schema or the actual problem at hand are very much  
welcome.  I'm using 0.20.4, initially testing as standalone on my  
development laptop (OS X), all settings default except for data  
directory, and accessing hbase through the Java api.

In my initial testing I have 50 Gateways, each of which are  
responsible for 100 unique Devices, each of which report their power  
usage every second.  So that's 5000 total, unique Devices.  Here are  
the queries I'd like to answer:

1) What is the current power consumption of Device X?
2) What is the average power consumption of Device X btw Date 1 and  
Date 2?
3) What is the current power consumption at Gateway Y?
4) What is the average power consumption at Gateway Y btw Date 1 and  
Date 2?

I'm imagining this as two tables - "devices" and "gateways".  The  
devices table has a column family called "device_samples" which only  
has one column "power" and 5000 rows (one for each device).  Every new  
sample gets written to the power column of its device at the timestamp  
from the original sample sent by the Device.  Now I can answer query 1  
with a simple get, and I can answer query 2 using the api setTimeRange  
call on another simple get (and do my own math to average the  
results).  This works great so far - with 50k versions in each cell  
query 1 is less than 50ms, and query 2 is only marginally more (on my  
dev machine, remember).

The gateways table could just hold the list of its deviceids and then  
I have to manually fetch its 100 device entries from the devices  
table, but that proves to be quite slow.  So at the cost of disks I  
tried a schema such that it has a cf "gateway_samples" where each row  
is a gateway id (so exactly 50 rows), and it has a column for each of  
its 100 devices (so each row has 100 columns, but the cf has 5000  
columns).  Each sample is written to those cells in the same way as  
the devices table.  Then I should be able to answer query 3 with a  
"get latest versions from the whole row" and do my own sums, and  
similarly query 4.  In practice though, this works as expected (50ms)  
with very little data in the gateways table (50k total keyvalues), but  
once I've run the devices for a bit (~1.5M total keyvalues) a single  
row fetch takes 600ms.

Granted these are performance numbers from a dev machine with hbase  
running in standalone mode, so have no bearing on reality.  But it  
feels like I'm doing something wrong when the devices table responds  
very quickly and the gateways doesn't.  I've tried moving hbase to an  
old linux machine with the client still running from my dev machine  
and got basically the same results with a bit extra time for the  

Any and all advice is appreciated.


View raw message