hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ashish Thusoo <>
Subject RE: Hive for RRA Data
Date Wed, 01 Apr 2009 13:55:55 GMT

I think you have too little data. What you will end up with lots of small files with this
approach and the fragmentation itself (considering that you will not be able to fill up even
one block in the partition) will kill the performance on the cluster.

There are 2 things that you could do to alleviate this problem to some extent:

1. You could just slap on an external table on a directory that contains the data in hdfs.
And you could you hadoop appends(appends are there since 0.18 - I think) to update a file
in this directory that has all this data. You can then use Hive as usual on this data.
2. You can change your loading process to run a hive union all to create a file that has the
merged data and then drop and recreate the original table. We could add a swap command to
hive (similar to what oracle does with swap partitions) to make this better too...

But the key is to keep your blocks full, otherwise, fragmentation will really slow down your

From: Edward Capriolo []
Sent: Tuesday, March 31, 2009 9:06 AM
Subject: Re: Hive for RRA Data

I made some progress with this. Rather then going with the
anti-pattern of storing a column name as a column I decided to create
a table for each 'cacti data template' type. After extracting the
columns from the RRD file I do:

    hql.append("CREATE TABLE IF NOT EXISTS "+schema.getTableName() +" (" );
    hql.append(" row_time BIGINT, ");
    Iterator i = schema.getColumns().iterator();
    while  (i.hasNext()){
      hql.append(" "" DOUBLE ");
      if (i.hasNext()){
        hql.append(" , ");
    hql.append(" ) ");
    hql.append( " PARTITIONED BY ( day STRING, data_template_data_id INT) ");
    hql.append( " ROW FORMAT DELIMITED " );
    hql.append( " FIELDS TERMINATED BY '\\054' " );
    hql.append( " LINES TERMINATED BY '\\012' ");

My data looks like: (without the column header)

file: server1_harddrive.rrd

time, hdd_used, hdd_free
1234840, 56,90
1235840, 54,92

I write the data to hadoop then I use 'load infile'.
The Good news:
I am partitioned by day and data_template_data_id ( data id). This
makes a query for specific data fast. Easy to count and group.
The Bad news:
Each partition/RRD file is about 4KB-8KB. I have about 400 devices,
and 9000 data sources. My block size is 128 MB so poor data/block.

I was thinking of rolling this data into months, but after doing some
high level math even a years for of data would only be able 2MB.

I considered making 'data_template_data_id' a column rather then a
partition. That has the same result really. For my deployment, each
device of the 300 devices has 1 or two hard drives. 1 Day for all
three hundred hard drives still does not fully utilize block. (If you
had 3000 or 30000 devices you might be able to utilize the block)

I am considering no partitions. Now the 'no append' in hadoop is an
issue. My process is going to be kicked off nightly, resulting in
daily files. More or less the same as partitioning by day. Can hive
merge up those files automatically?

Any ideas?

View raw message