hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Bradford Cross <bradford.n.cr...@gmail.com>
Subject Re: financial time series database
Date Thu, 02 Apr 2009 15:41:32 GMT
Cool, so the schema I am leaning toward is:

-hijack time stamp to be the time of each observation.  Use a column family
to hold all the data, and a column for each property of  each observation.

Since HBase sorts the timestamps descending, it seems like hijacking the
timestamps makes sense.  Any performance implications of this that I should
be aware of?

Hijacking the time stamps seems to be fairly intuitive, and leverages the
time stamps which I otherwise would not really care about if I just ignored
timestamps and dumped all data including the date/time of observations into
columns.

Are there any downsides to hijacking the timestamps like this?



On Thu, Apr 2, 2009 at 12:13 AM, stack <stack@duboce.net> wrote:

> I should also state that apart from the hbase inadequacy, your schema looks
> good (hbase should be able to carry this schema-type w/o sweat -- hopefully
> 0.20.0).
> St.Ack
>
> On Thu, Apr 2, 2009 at 9:12 AM, stack <stack@duboce.net> wrote:
>
> > How many columns will you have?  Until we fix
> > https://issues.apache.org/jira/browse/HBASE-867, you are limited regards
> > the number of columns you can have.
> > St.Ack
> >
> >
> > On Thu, Apr 2, 2009 at 4:48 AM, Bradford Cross <
> bradford.n.cross@gmail.com
> > > wrote:
> >
> >> Based on reading the hbase architecture wiki, I have changed my thinking
> >> due
> >> to the "Column Family Centric Storage."
> >>
> >> HBase stores column families physically close on disk, so the items in a
> >> given column family should have roughly the same read/write
> >> characteristics
> >> and contain similar data.  Although at a conceptual level, tables may be
> >> viewed as a sparse set of rows, physically they are stored on a
> per-column
> >> family basis. This is an important consideration for schema and
> >> application
> >> designers to keep in mind.
> >>
> >> This leads me to the thought of keeping an entire time series inside a
> >> single column family.
> >>
> >> Options:
> >>
> >> Row key is a ticker symbol:
> >> - hijack time stamp to be the time of each observation.  Use a column
> >> family
> >> to hold all the data, and a column for each property of  each
> observation.
> >> -don't hijack the time stamp, just ignore it.  Use a column family for
> all
> >> the data, and use an individual column for the date/time of the
> >> observation,
> >> and individual columns for each property of each observation.
> >>
> >> thoughts?
> >>
> >> On Tue, Mar 31, 2009 at 7:25 PM, Bradford Cross
> >> <bradford.n.cross@gmail.com>wrote:
> >>
> >> > Greetings,
> >> >
> >> > I am prototyping a financial time series database on top of HBase and
> >> > trying to head my head around what a good design would look like.
> >> >
> >> > As I understand it, I have rows, column families, columns and cells.
> >> >
> >> > Since the only think that Hbase really "indexes" is row keys, it seems
> >> > natural in a way to represent the rowkeys as the date/time.
> >> >
> >> > As a simple example:
> >> >
> >> > Bar data:
> >> >
> >> > {
> >> >    "2009/1/17" : {
> >> >      "open":"100",
> >> >      "high":"102",
> >> >      "low":"99",
> >> >      "close":"101"
> >> >      "volume":"1000256"
> >> >    }
> >> > }
> >> >
> >> >
> >> > Quote data:
> >> >
> >> > {
> >> >    "2009/1/17:11:23:04" : {
> >> >      "bid":"100.01",
> >> >      "ask":"100.02",
> >> >      "bidsize":"10000",
> >> >      "asksize":"100200"
> >> >    }
> >> > }
> >> >
> >> > But there are many other issues to think about.
> >> >
> >> > In financial time series data we have small amounts of data within
> each
> >> > "observation" and we can have lots of observations.  We can have
> >> millions of
> >> > observations per time series (f.ex. all historical trade and quote
> date
> >> for
> >> > a particular stock since 1993)across hundreds of thousands of
> individual
> >> > instruments (f.ex. across all stocks that have traded since 1993.)
> >> >
> >> > The write patterns fit HBase nicely, because it is a write once and
> >> append
> >> > pattern.  This is followed by loads of offline processes for
> simulating
> >> > trading models and such.  These query patterns look like "all quotes
> for
> >> all
> >> > stocks between the dates of 1/1/996 and 12/31/2008."  So the querying
> is
> >> > typically across a date range, and we can further filter the query by
> >> > instrument types.
> >> >
> >> > So I am not sure what makes sense for efficiency because I do not
> >> > understand HBase well enough yet.
> >> >
> >> >  What kinds of mixes of rows, column families, and columns should I be
> >> > thinking about?
> >> >
> >> > Does my simplistic approach make any sense?  That would mean each row
> is
> >> a
> >> > key-value pair where the key is is the date/time and the value is the
> >> > "observation."  I suppose this leads to a "table per time series"
> model.
> >> > Does that make sense or is there overhead to having lots of tables?
> >> >
> >>
> >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message