hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ulrich Staudinger <ustaudin...@activequant.com>
Subject Re: How would you model this in Hbase?
Date Thu, 07 Feb 2013 07:14:04 GMT
Why don't you simply identify the six different types of information per
number:

- figure name (unemployment)
- month (reporting)
- release date
- figure
- revision date
- revised figure

the key would be:
<figure name>_<month>

en voila.

I strongly advise against "overloading" the timestamping/versioning feature
of hbase.


You would still have to load the entire series and sort it by what you
like, but that's not a problem with hbase.

Thinking in ActiveQuant, you would store each of the columns above through
it's IArchiveWriter. Then you can seamlessly view/chart it in the
ActiveQuant Master Server, making it available over CSV and SOAP to your
corporate intranet.


Cheers



On Wed, Feb 6, 2013 at 11:01 PM, James Taylor <jtaylor@salesforce.com>wrote:

> Another approach would be to use Phoenix (http://github.com/**
> forcedotcom/phoenix <http://github.com/forcedotcom/phoenix>). You can
> model your schema as you would in the relational world, but you get the
> horizontal scalability of HBase.
>
>     James
>
>
> On 02/06/2013 01:49 PM, Michael Segel wrote:
>
>> Overloading the time stamp aka the versions of the cell is really not a
>> good idea.
>>
>> Yeah, I know opinions are like A.... everyone has one. ;-)
>>
>> But you have to be aware that if someone decides to delete some data...
>> well one tombstone marker for the column, goodbye all of the versions of
>> the cell.
>> (Any ideas on a clean easy way to remove that tombstone?  ;-)
>>
>> You're better off using other methods of adding dimension to your cell.
>> One that works well is using Avro.
>>
>> When I teach a course on HBase, I do mention about cells in my schema
>> design section of the course. I talk about the ability to use the
>> versioning as a way to add dimension and then tell the students that this
>> really isn't a good idea ...
>>
>> -Just saying...
>>
>> On Feb 6, 2013, at 3:05 PM, Ian Varley <ivarley@salesforce.com> wrote:
>>
>>  Alex,
>>>
>>> This might be an interesting use of the time dimension in HBase. Every
>>> value in HBase is uniquely represented by a set of coordinates:
>>>
>>> - table
>>> - row key
>>> - column family
>>> - column qualifier
>>> - timestamp
>>>
>>> So, you can have two different values that have all the same
>>> coordinates, except their timestamp. So for your example, that could be:
>>>
>>> - table: econ
>>> - row key: "indicatorABC"
>>> - column family: cf1
>>> - column qualifier: "reporting_2011-10-01"
>>>
>>> first value:
>>> - timestamp: "2011-11-01 00:00:00.000"
>>> - value: 2
>>>
>>> second value:
>>> - timestamp: "2011-12-01 00:00:00.000"
>>> - value: 2.5
>>>
>>> I.e., if you load the data such that the timestamps on the values
>>> represent the release date, then you can model this in a natural way. By
>>> default, reads in HBase will only give you the latest value, but you can
>>> manually tell a scanner to give you "time travel" by only reporting values
>>> as of an older date; so you could say "tell me what the data would have
>>> said on 11/01".
>>>
>>> (Also, by default, HBase only keeps a limited number of historical
>>> versions (3), but you can tell it to keep all versions.)
>>>
>>> There are some downsides to using the time dimension explicitly like
>>> this:
>>> - If you back date things and also work with deletes, you could get some
>>> weird behavior depending on when compaction runs.
>>> - If you have lots of versions of things, the server still has to read
>>> over these when you scan, which makes things slower. (Probably doesn't
>>> apply if you only have a couple historical versions of any given value.)
>>>
>>> All the usual caveats apply: don't bother with HBase unless you've got
>>> some serious size in your data (e.g. TB) and need to support a heavy load
>>> of real-time updates and queries. Otherwise, go with something simpler to
>>> operate like a relational database, couchdb, etc.
>>>
>>> Ian
>>>
>>> On Feb 6, 2013, at 2:24 PM, Alex Grund wrote:
>>>
>>> Hi,
>>>
>>> I am a newbie in nosql-databases and I am wondering how to model a
>>> specific case with Hbase.
>>>
>>> The thing I want to model are economic time series, such as
>>> unemployment rate in a given country.
>>>
>>> The complicated thing is this: Values of an economic time series can,
>>> but do not have to be revised.
>>>
>>> An example:
>>>
>>> Imagine, the time series is published monthly, at the first day of a
>>> month with the value for the previous month, such like:
>>>
>>> Unemployment; release: 2011/12/01; reporting: 2011/11/01; value: 1
>>> Unemployment; release: 2011/11/01; reporting: 2011/10/01; value: 2
>>> Unemployment; release: 2011/10/01; reporting: 2011/09/01; value: 3
>>> Unemployment; release: 2011/09/01; reporting: 2011/08/01; value: 4
>>>
>>> (where "release" is the date of release and "reporting" is the date of
>>> the month the "value" refers to. Read: "On Dec 1, 2011 the
>>> unemployement rate for Nov 2011 was reported to be "1").
>>>
>>> Now, imagine, that on every release, the value for the previous month
>>> is revised, such like:
>>>
>>> Unemployment; release: 2011/12/01; reporting: 2011/11/01; value: 1
>>> Unemployment; release: 2011/12/01; reporting: 2011/10/01; value: 2.5
>>>
>>> Unemployment; release: 2011/11/01; reporting: 2011/10/01; value: 2
>>> Unemployment; release: 2011/11/01; reporting: 2011/09/01; value: 3.5
>>>
>>> Unemployment; release: 2011/10/01; reporting: 2011/09/01; value: 3
>>> Unemployment; release: 2011/10/01; reporting: 2011/08/01; value: 4.5
>>>
>>> Unemployment; release: 2011/09/01; reporting: 2011/08/01; value: 4
>>> Unemployment; release: 2011/09/01; reporting: 2011/07/01; value: 5.5
>>>
>>> Read: On Oct, 1, 2011, the unemployment rate was reported to be "3"
>>> for Sep, and the revised value for Aug was reported to be "4.5".
>>>
>>> The most recent observation (release) ex-post is:  [1]
>>> Unemployment; release: 2011/12/01; reporting: 2011/11/01; value: 1
>>> Unemployment; release: 2011/12/01; reporting: 2011/10/01; value: 2.5
>>>
>>> Since the data is not revised further than one month behind, the whole
>>> series ex-post would look like that: [3]
>>> Unemployment; release: 2011/12/01; reporting: 2011/11/01; value: 1
>>> Unemployment; release: 2011/12/01; reporting: 2011/10/01; value: 2.5
>>>
>>> Unemployment; release: 2011/11/01; reporting: 2011/09/01; value: 3.5
>>>
>>> Unemployment; release: 2011/10/01; reporting: 2011/08/01; value: 4.5
>>>
>>> Unemployment; release: 2011/09/01; reporting: 2011/07/01; value: 5.5
>>>
>>> Whereas, the "known-to-market"-series would look like that: [2]
>>>
>>> Unemployment; release: 2011/12/01; reporting: 2011/11/01; value: 1
>>> Unemployment; release: 2011/11/01; reporting: 2011/10/01; value: 2
>>> Unemployment; release: 2011/10/01; reporting: 2011/09/01; value: 3
>>> Unemployment; release: 2011/09/01; reporting: 2011/08/01; value: 4
>>>
>>> That are the series I want to get from the db.
>>>
>>>
>>> How would you model this with Hbase? Is Hbase suitable for that
>>> application? Or in general, a column oriented DB?
>>>
>>> Or, is a a relational approach a better fit?
>>>
>>>
>>> Thanks!
>>>
>>>  The opinions expressed here are mine, while they may reflect a
>> cognitive thought, that is purely accidental.
>> Use at your own risk.
>> Michael Segel
>> michael_segel (AT) hotmail.com
>>
>>
>>
>>
>>
>>
>>
>>
>


-- 
Ulrich Staudinger, Managing Director and Sr. Software Engineer, ActiveQuant
GmbH

P: +41 79 702 05 95
E: ustaudinger@activequant.com

http://www.activequant.com

AQ-R user? Join our mailing list:
http://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/aqr-user

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message