hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ryan Rawson <ryano...@gmail.com>
Subject Re: efficient storage of timeseries
Date Mon, 15 Feb 2010 08:03:36 GMT
In terms of efficiency - there are 2 answers to that:
- compression helps a lot
- disk space on DFS is substantially cheaper than on regular RDBM systems

As for the schema design, I'd have to generally suggest going with
taller tables and scans than going with thick rows - this is because
the current way RPCs work we must hold an entire row in memory when
streaming it back to the client.  If your data eventually grows
extremely large, you could run into OOME.  Thus taller tables with
scans is the way to go.


The current shell doesnt have a super elegant way to view versions
right now, it's mostly row-centric.  To create a table with many row
versions it's part of the create options in the shell, do 'help' in
there.  Here is a brief example:
create 'table', {FAMILY=>'famName', VERSIONS=>10000}

you can also do an alter, after disabling the table:
disable 'table'
alter 'table', {FAMILY=>'famName', VERSIONS=>10000}
enable 'table'

One advantage of not using timestamps as important data is that you
get versioning, which can help sometimes. (bugs, etc)

Some other people have done tsdb on hbase, I'll let them speak up.




On Sun, Feb 14, 2010 at 9:33 PM, Eric Maland <eric.maland@gmail.com> wrote:
> Hi,
>
> I'm pretty new to Hbase so bear with me - I am porting over a system for
> storage of timeseries data to both HBase and Cassandra.  This is pretty
> straightforward but it opens a ton of questions about how the schema should
> be designed and how the data should be stored.
>
> I'm hoping some of you might have some advice on what route I should settle
> on.  I'd like to over some ideas and questions they raise.
>
> Let's say we're fetching data about hosts like load average, disk free, etc.
>  Ideas I've come up with thus far are:
>
> row key: host
> cf/qualifier: attribute:name
> value: whatever
> row timestamp: time of sampling
>
> Example:
>
> ROW                          COLUMN+CELL
>
>  butters                        column=attribute:load_1,
> timestamp=1266208871, value=0.3
>  butters                        column=attribute:load_1,
> timestamp=1266208853, value=0.3
>  butters                        column=attribute:load_1,
> timestamp=1266208324, value=0.3
>  butters                        column=attribute:load_5,
> timestamp=1266208871, value=1.3
>  butters                        column=attribute:load_5,
> timestamp=1266208853, value=1.3
>  butters                        column=attribute:load_5,
> timestamp=1266208325, value=1.3
>  kenny                         column=attribute:load_1,
> timestamp=1266208871, value=0.6
>  kenny                         column=attribute:load_1,
> timestamp=1266208853, value=0.6
>  kenny                         column=attribute:load_1,
> timestamp=1266208324, value=0.6
>
> etc.
>
> You can easily fetch multiple versions using setTimeRange(), and you
> can potentially add new data sampled for a host just by adding another
> qualifier for that row (eg load_5 or disk_free or whatever).
>
> The downside, or part I don't really understand here, is trying to figure
> out how many versions you have (count 'table' doesn't report version counts)
> and configuring how many versions you store, it appears to only store 3 by
> default.  I don't know how to change that, and I don't know if versions of a
> row are physically stored next to the previous versions of the row or not.
>  Is this a suboptimal configuration if we expect the data to grow to several
> terabytes within the year (or say, 3 months)?
>
> Another option:
>
> row key: host-timestamp
> cf/qualifier: attribute:name
> value: whatever
> row timestamp: time of row insert
>
> ROW                          COLUMN+CELL
>
>  butters-1266208871     column=attribute:load_1, value=0.3
>
>  butters-1266208853     column=attribute:load_1, value=0.3
>
>  butters-1266208324     column=attribute:load_1, value=0.3
>
>  butters-1266208871     column=attribute:load_5, value=1.3
>
>  butters-1266208853     column=attribute:load_5, value=1.3
>
>  butters-1266208325     column=attribute:load_5, value=1.3
>
>  kenny-1266208871      column=attribute:load_1, value=0.6
>
>  kenny-1266208853      column=attribute:load_1, value=0.6
>
>  kenny-1266208324      column=attribute:load_1, value=0.6
>
> ...
>
> This duplicates the timestamp in the key, which seems wasteful but gives
> proper counts and you can just key range searches to fetch rows in a range.
>  Is this going to be slower but spread data out further?  Is this a good
> idea?  Is filtering on a column family's qualifier efficient (say I want
> load_1 for the last 2 years for a host, but there are 1000 or so qualifiers
> per row.. is that going to explode anywhere?)?
>
> I've also thought about:
>
> row key: host-attribute
> cf/qualifier: value:
> value: whatever
> row timestamp: time of row insert
>
> or:
>
> row key: host-attribute-timestamp
> cf/qualifier: value:
> value: whatever
> row timestamp: time of row insert
>
> But these also seem like suboptimal methods.  If anyone has experience or
> thoughts on this I'd love to hear them!
>
> Also, I incorporated the Thrift server changes from
> http://issues.apache.org/jira/browse/HBASE-1744 into the current Thrift
> server to get at the time range and version functionality (I added a few
> features to the API and fixed it so it'll work with Python).  If anyone is
> interested in having that patch for the current thrift server let me know,
> it's simple and I'll happily share it.  I'll likely stuff it up on github
> soon.
>
> Thanks!
> Eric
>

Mime
View raw message