hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Hegner, Travis" <THeg...@trilliumit.com>
Subject RE: performance using versions as dimension
Date Fri, 21 May 2010 12:58:45 GMT
Since every datapoint is relatively small, you could store the same data in multiple tables.

Maybe you could do an entire table dedicated to the current value with a <gw_id><dev_id>
row key. You could aggregate and store the entire gateway consumption on insert rather than
retrieval too, since it seems that retrieval speeds are of utmost importance to you. <gw_id>00000000
may work, or whatever works for you.

Then you could have a table with nothing but timestamps as a row key, and then store the <gw_id><dev_id>
as a column with the value, if you received one during that timestamp. Stored this way, your
largest row possible only has 5000 columns in it, instead of ever increasing to an arbitrary
size. With that you could just get your time range, and search through each for your <dev_id>
or <gw_id>, and aggregate them accordingly.

I would test those methods against the second one you proposed below. I don't know which would
be faster.

As long as disk space is no object (with hadoop/hbase, it shouldn't be) then store the same
data as many times as you need in order to decrease your retrieval times. It's kind of backwards
from an rdbms, but that's just how it works. I think of it as manual indexing.

Travis Hegner

-----Original Message-----
From: Oliver Meyn [mailto:oliver.meyn@zerofootprint.net]
Sent: Thursday, May 20, 2010 3:06 PM
To: user@hbase.apache.org
Subject: Re: performance using versions as dimension

Hi Travis,

Thanks for the suggestions.  As it happens I simplified the problem
for my original question and now the details start to matter.  The
gateways actually have some smarts, and will only send a device sample
if the device's power consumption has changed "significantly", where
that significance is configurable per gateway.  That means the latest
power sample for a device could have been sent hours (or even days)
ago, so scanning for the latest result is trickier than "go back at
least one second and you're guaranteed a result".  That's why I liked
the versions style where the sample on top of the "stack" is the
latest, regardless of when that was.

If you have any thoughts on that wrinkle, I'm happy to hear them.  In
the meantime I'm trying out your schema anyway, and another variant in
which each sample gets a new column (named after its timestamp)
against a row key of gw_id.dev_id.


On 20-May-10, at 9:34 AM, Hegner, Travis wrote:

> Oliver,
> It may be an assumption I've made, but it seems to me that hbase is
> most efficient at handling a larger number of rows than timestamps,
> or even columns for that matter (I think it's on the Hbase main page
> I read "Billions of rows x Millions of columns x thousands of
> versions", which leads to my assumption).
> Perhaps you should consider testing with each datapoint stored as an
> individual row, with a row id like: <unix_time><gw_id><dev_id> or
> <unix_time>.<gw_id>.<dev_id>
> With that method, you could answer query 1 by finding the last entry
> for a given "<gw_id><dev_id>", query 3 by getting all the latest
> <dev_id>'s for any given gw, and queries 2 and 4 by simply grabbing
> a range of rows and parsing through the results since they are
> already ordered by the timestamp that they arrived.
> This way, you are really only "getting" what you actually need, and
> to scan for the latest entry of any given device, your only having
> to scan through 5000 very small rows at most.
> Just a thought, HTH,
> Travis Hegner
> http://www.travishegner.com/
> -----Original Message-----
> From: Oliver Meyn [mailto:oliver.meyn@zerofootprint.net]
> Sent: Thursday, May 20, 2010 8:59 AM
> To: user@hbase.apache.org
> Subject: Re: performance using versions as dimension
> Thanks for the quick reply Jonathan:
> On 19-May-10, at 5:04 PM, Jonathan Gray wrote:
>> What you have is 100 columns in each row.  Each column has lots of
>> versions.  You want values from all 100 columns within a specified
>> TimeRange... correct?
> This is the 2nd of my two gateway queries - this one being slow I
> understand and your explanation makes sense.  The first query is a
> simple "get me the latest version from every column for this row" and
> that is what, to me, is perplexingly slow.  To be clear, there's a
> good chance that each of those columns will have a different
> timestamp, but "the latest reading" is what I'm interested in.
>> I need to think on what the best way to implement this would be,
>> perhaps with a better understanding now you can too :)
> I know it's something of a religious topic, but as of 0.20.4, is using
> versions as a data dimension legitimate?  Because I could easily
> approach millions of versions per column, am I in danger of running
> into the elsewhere-mentioned row-split problem (each of my cell values
> is a double)?  I ask because if that's going to be a problem then I
> need to rethink my schema anyway, and then we don't need to waste
> cycles on the current problem.
> Thanks again,
> Oliver
>>> -----Original Message-----
>>> From: Oliver Meyn [mailto:oliver.meyn@zerofootprint.net]
>>> Sent: Wednesday, May 19, 2010 1:53 PM
>>> To: user@hbase.apache.org
>>> Subject: performance using versions as dimension
>>> Hi All,
>>> I'm new to hbase and columnar storage schemas, so any comments you
>>> have on the schema or the actual problem at hand are very much
>>> welcome.  I'm using 0.20.4, initially testing as standalone on my
>>> development laptop (OS X), all settings default except for data
>>> directory, and accessing hbase through the Java api.
>>> In my initial testing I have 50 Gateways, each of which are
>>> responsible for 100 unique Devices, each of which report their power
>>> usage every second.  So that's 5000 total, unique Devices.  Here are
>>> the queries I'd like to answer:
>>> 1) What is the current power consumption of Device X?
>>> 2) What is the average power consumption of Device X btw Date 1 and
>>> Date 2?
>>> 3) What is the current power consumption at Gateway Y?
>>> 4) What is the average power consumption at Gateway Y btw Date 1 and
>>> Date 2?
>>> I'm imagining this as two tables - "devices" and "gateways".  The
>>> devices table has a column family called "device_samples" which only
>>> has one column "power" and 5000 rows (one for each device).  Every
>>> new
>>> sample gets written to the power column of its device at the
>>> timestamp
>>> from the original sample sent by the Device.  Now I can answer
>>> query 1
>>> with a simple get, and I can answer query 2 using the api
>>> setTimeRange
>>> call on another simple get (and do my own math to average the
>>> results).  This works great so far - with 50k versions in each cell
>>> query 1 is less than 50ms, and query 2 is only marginally more (on
>>> my
>>> dev machine, remember).
>>> The gateways table could just hold the list of its deviceids and
>>> then
>>> I have to manually fetch its 100 device entries from the devices
>>> table, but that proves to be quite slow.  So at the cost of disks I
>>> tried a schema such that it has a cf "gateway_samples" where each
>>> row
>>> is a gateway id (so exactly 50 rows), and it has a column for each
>>> of
>>> its 100 devices (so each row has 100 columns, but the cf has 5000
>>> columns).  Each sample is written to those cells in the same way as
>>> the devices table.  Then I should be able to answer query 3 with a
>>> "get latest versions from the whole row" and do my own sums, and
>>> similarly query 4.  In practice though, this works as expected
>>> (50ms)
>>> with very little data in the gateways table (50k total keyvalues),
>>> but
>>> once I've run the devices for a bit (~1.5M total keyvalues) a single
>>> row fetch takes 600ms.
>>> Granted these are performance numbers from a dev machine with hbase
>>> running in standalone mode, so have no bearing on reality.  But it
>>> feels like I'm doing something wrong when the devices table responds
>>> very quickly and the gateways doesn't.  I've tried moving hbase to
>>> an
>>> old linux machine with the client still running from my dev machine
>>> and got basically the same results with a bit extra time for the
>>> network.
>>> Any and all advice is appreciated.
>>> Thanks,
>>> Oliver

The information contained in this communication is confidential and is intended only for the
use of the named recipient.  Unauthorized use, disclosure, or copying is strictly prohibited
and may be unlawful.  If you have received this communication in error, you should know that
you are bound to confidentiality, and should please immediately notify the sender or our IT
Department at  866.459.4599.

View raw message