hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Robert Zubek <rob...@threerings.net>
Subject Re: Storing/retrieving time series with hadoop
Date Mon, 12 Jan 2009 20:11:02 GMT
We use Hadoop to warehouse time series data, and run analytics on them.

Being able to parallelize our analytics jobs, and scale up the cluster 
as needed for the data, turned out to be a big win.

However, we rolled our own storage solution. At the time when we started 
on this project, there were no good solutions for storing time series 
(maybe there are right now). I investigated HBase, but it was optimized 
for retrieving just the latest values, not the entire time series for 
analysis. We also investigated Pig, but it was too early in the 
project's life, and didn't support everything we wanted.

As for latency - with S3 it can be significant, depending on how you lay 
out your data; we have a separate caching layer just to speed up data 
retrieval for graph drawing. I haven't tried HDFS over clustered hard 
drives, though; it might be fast enough for your purposes.


Brock Judkins wrote:
> Hi list,
> I am researching hadoop as a possible solution for my company's data
> warehousing solution. My question is whether hadoop, possibly in combination
> with Hive or Pig, is a good solution for time-series data? We basically have
> a ton of web analytics to store that we display both internally and
> externally.
> For the time being I am storing timestamped data points in a huge MySQL
> table, but I know this will not scale very far (although it's holding up ok
> at almost 90MM rows). I am aware that hadoop can scale insanely large
> (larger than I need), but does anyone have experience using it to draw
> charts based on time series with fairly low latency?
> Thanks!
> Brock

View raw message