If I understand correctly, the reason performance gets so bad for the gateway query is you
end up skipping past (but still reading from HDFS) a bunch of data you're throwing away.
What you have is 100 columns in each row. Each column has lots of versions. You want values
from all 100 columns within a specified TimeRange... correct?
That column family is stored like this:
<col1><ts10><value>
<col1><ts9><value>
... (ts 8 through 2 are here)
<col1><ts1><value>
<col2><ts10><value>
<col2><ts9><value>
...
...
<col100><ts1><value>
So if you want the values for all columns at ts=10, HBase will have to read through all the
ts=9 through ts=1 values as well.
There are some optimizations planned for 0.21 (hopefully) that will do better at skipping
unnecessary blocks, but even with that you will likely end up reading a lot of unnecessary
data.
I need to think on what the best way to implement this would be, perhaps with a better understanding
now you can too :)
JG
> -----Original Message-----
> From: Oliver Meyn [mailto:oliver.meyn@zerofootprint.net]
> Sent: Wednesday, May 19, 2010 1:53 PM
> To: user@hbase.apache.org
> Subject: performance using versions as dimension
>
> Hi All,
>
> I'm new to hbase and columnar storage schemas, so any comments you
> have on the schema or the actual problem at hand are very much
> welcome. I'm using 0.20.4, initially testing as standalone on my
> development laptop (OS X), all settings default except for data
> directory, and accessing hbase through the Java api.
>
> In my initial testing I have 50 Gateways, each of which are
> responsible for 100 unique Devices, each of which report their power
> usage every second. So that's 5000 total, unique Devices. Here are
> the queries I'd like to answer:
>
> 1) What is the current power consumption of Device X?
> 2) What is the average power consumption of Device X btw Date 1 and
> Date 2?
> 3) What is the current power consumption at Gateway Y?
> 4) What is the average power consumption at Gateway Y btw Date 1 and
> Date 2?
>
> I'm imagining this as two tables - "devices" and "gateways". The
> devices table has a column family called "device_samples" which only
> has one column "power" and 5000 rows (one for each device). Every new
> sample gets written to the power column of its device at the timestamp
> from the original sample sent by the Device. Now I can answer query 1
> with a simple get, and I can answer query 2 using the api setTimeRange
> call on another simple get (and do my own math to average the
> results). This works great so far - with 50k versions in each cell
> query 1 is less than 50ms, and query 2 is only marginally more (on my
> dev machine, remember).
>
> The gateways table could just hold the list of its deviceids and then
> I have to manually fetch its 100 device entries from the devices
> table, but that proves to be quite slow. So at the cost of disks I
> tried a schema such that it has a cf "gateway_samples" where each row
> is a gateway id (so exactly 50 rows), and it has a column for each of
> its 100 devices (so each row has 100 columns, but the cf has 5000
> columns). Each sample is written to those cells in the same way as
> the devices table. Then I should be able to answer query 3 with a
> "get latest versions from the whole row" and do my own sums, and
> similarly query 4. In practice though, this works as expected (50ms)
> with very little data in the gateways table (50k total keyvalues), but
> once I've run the devices for a bit (~1.5M total keyvalues) a single
> row fetch takes 600ms.
>
> Granted these are performance numbers from a dev machine with hbase
> running in standalone mode, so have no bearing on reality. But it
> feels like I'm doing something wrong when the devices table responds
> very quickly and the gateways doesn't. I've tried moving hbase to an
> old linux machine with the client still running from my dev machine
> and got basically the same results with a bit extra time for the
> network.
>
> Any and all advice is appreciated.
>
> Thanks,
> Oliver
>
>
|