hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Bartosz M. Frak " <bar...@bnl.gov>
Subject Rowkey design for time series data
Date Tue, 03 Jul 2012 22:22:26 GMT
Hey Guys,

Before I get to my thoughts on the rowkey design, here's some background 
info about the problem we are trying to tackle.

We are producing about 60TB of data a year (uncompressed). Most of this 
data is collected continuously from various detectors around our 
facility Vast majority of it is numerical (either scalar or 
1-dimensional array). Detectors can be either polled at regular 
intervals or they can send their measurements asynchronously. We are 
currently using regular compressed files to store all this information 
with headers (metadata) stored in a relational database. Managing 
(moving, archiving) this amount of data is slowly becoming a nuisance, 
so we are investigating other solutions, including  distributed 
databases like HBase.

We have a number of requirements based on our users' access patters - 
the most important one being a (ridiculously) fast sequential, time 
sorted access for selected metric(s). We also want to decimate the data 
for live viewing (visually compressing billions of time series points 
into a manageable size without loosing the "shape" of the original 
dataset) . We are already doing this in our custom middleware server, 
but this looks like a problem that could be tackled by MapReduce.

Now, back to the subject matter. I examined both 
OpenTSDB(http://opentsdb.net/schema.html) and HBaseWD 

solutions and although their general ideas looks right, neither one of 
them looks like a perfect fit for us. The former correctly distributes 
writes across the entire cluster, but the sequential, time ordered reads 
for the same metric end up being localized to a relatively small number 
of regions (with enough collected data over a long period of time the 
reads will hit just one, maybe two regions). The latter also distributes 
the writes across the entire cluster, but the reads require BUCKET_COUNT 
(BC) number of scans and they are almost guaranteed to be out of order 
across multiple buckets (they are in the correct relative order within 
each bucket).
I was thinking about a rowkey design, which takes another time dimension 
into consideration (the timestamp or some form of it ends up at after 
the metric name itself) for example hour of the day. This value ranging 
from 1 to 24 would be prefixed to each rowkey (i.e. 
7-metric.name-1349585333) - obviously this is a terrible design, because 
a handful of regions would end up being overloaded based on the hour of 
the day, however we can use the metric name hash modded with the bucket 
count (24 in this case) to come up with a new starting prefix base for 
each metric. Now we have to add the real hour of day to the base and 
subtract BC if the value is greater than BC This way all writes are 
still distributed evenly across the system ... and so are the reads 
assuming we reading more than one hour worth of data in this case, which 
is almost always true in our case. We still have to do BC scans if 
reading 24+ hours of data, but the data in and across the buckets is 
always correctly time sorted. We can also limit the scan count based on 
the selected time range (i.e. if someone asks for data for a given 
metric between 7am and 10am we'll only have to do 3 scans for those 
three full hours).

I'm a complete newb when it comes to distributed databases, so if I'm 
way off on this please set me straight.


View raw message