hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jonathan Gray <jg...@facebook.com>
Subject RE: performance using versions as dimension
Date Wed, 19 May 2010 21:04:00 GMT
If I understand correctly, the reason performance gets so bad for the gateway query is you
end up skipping past (but still reading from HDFS) a bunch of data you're throwing away.

What you have is 100 columns in each row.  Each column has lots of versions.  You want values
from all 100 columns within a specified TimeRange... correct?

That column family is stored like this:

<col1><ts10><value>
<col1><ts9><value>
... (ts 8 through 2 are here)
<col1><ts1><value>
<col2><ts10><value>
<col2><ts9><value>
...
...
<col100><ts1><value>

So if you want the values for all columns at ts=10, HBase will have to read through all the
ts=9 through ts=1 values as well.

There are some optimizations planned for 0.21 (hopefully) that will do better at skipping
unnecessary blocks, but even with that you will likely end up reading a lot of unnecessary
data.

I need to think on what the best way to implement this would be, perhaps with a better understanding
now you can too :)

JG

> -----Original Message-----
> From: Oliver Meyn [mailto:oliver.meyn@zerofootprint.net]
> Sent: Wednesday, May 19, 2010 1:53 PM
> To: user@hbase.apache.org
> Subject: performance using versions as dimension
> 
> Hi All,
> 
> I'm new to hbase and columnar storage schemas, so any comments you
> have on the schema or the actual problem at hand are very much
> welcome.  I'm using 0.20.4, initially testing as standalone on my
> development laptop (OS X), all settings default except for data
> directory, and accessing hbase through the Java api.
> 
> In my initial testing I have 50 Gateways, each of which are
> responsible for 100 unique Devices, each of which report their power
> usage every second.  So that's 5000 total, unique Devices.  Here are
> the queries I'd like to answer:
> 
> 1) What is the current power consumption of Device X?
> 2) What is the average power consumption of Device X btw Date 1 and
> Date 2?
> 3) What is the current power consumption at Gateway Y?
> 4) What is the average power consumption at Gateway Y btw Date 1 and
> Date 2?
> 
> I'm imagining this as two tables - "devices" and "gateways".  The
> devices table has a column family called "device_samples" which only
> has one column "power" and 5000 rows (one for each device).  Every new
> sample gets written to the power column of its device at the timestamp
> from the original sample sent by the Device.  Now I can answer query 1
> with a simple get, and I can answer query 2 using the api setTimeRange
> call on another simple get (and do my own math to average the
> results).  This works great so far - with 50k versions in each cell
> query 1 is less than 50ms, and query 2 is only marginally more (on my
> dev machine, remember).
> 
> The gateways table could just hold the list of its deviceids and then
> I have to manually fetch its 100 device entries from the devices
> table, but that proves to be quite slow.  So at the cost of disks I
> tried a schema such that it has a cf "gateway_samples" where each row
> is a gateway id (so exactly 50 rows), and it has a column for each of
> its 100 devices (so each row has 100 columns, but the cf has 5000
> columns).  Each sample is written to those cells in the same way as
> the devices table.  Then I should be able to answer query 3 with a
> "get latest versions from the whole row" and do my own sums, and
> similarly query 4.  In practice though, this works as expected (50ms)
> with very little data in the gateways table (50k total keyvalues), but
> once I've run the devices for a bit (~1.5M total keyvalues) a single
> row fetch takes 600ms.
> 
> Granted these are performance numbers from a dev machine with hbase
> running in standalone mode, so have no bearing on reality.  But it
> feels like I'm doing something wrong when the devices table responds
> very quickly and the gateways doesn't.  I've tried moving hbase to an
> old linux machine with the client still running from my dev machine
> and got basically the same results with a bit extra time for the
> network.
> 
> Any and all advice is appreciated.
> 
> Thanks,
> Oliver
> 
> 


Mime
View raw message