hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Guy Doulberg <>
Subject Synching HDFS directories with partitions on the Hive.
Date Thu, 07 Apr 2011 06:45:42 GMT
Hey folks,

I wanted to consult with on something that has been bothering me for a while...

I have declared external tables, these table are partitioned by dates_hour. I have a batch
hadoop process that updates the files under the partitions. I want the data to be accesed
via Hive, as soon as it is updated.

I came up with 3 solutions each has its own problem
1. Creating all partitions a month in  advance,  It creates empty directories on the HDFS
with the future partitions. As a result of that using ">" might fail the job, since it
loads empty file input.
2. When the batch finishes its work it updates the hive a new partitions has been added -
the batch need to "know" hive in order to update it, I want the batch to be agnostic towards
the Hive.
3. Having in the crontab a process that reads the HDFS and find all the partitions available
on the HDFS, and then lists all the declared partitions, find the delta, and add the partitions
in the delta.

Do you have other solutions?
Or improvements?


View raw message