hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Terje Marthinussen <>
Subject Scheduling jobs in hive
Date Thu, 28 Oct 2010 04:15:43 GMT

Are there any good scheduling tools out there suitable for the dependencies
you may get in Hive?

Specific example I have right now:
- 2 tables with event logs from different sources
- 1 table with some additional data from a different source, but this data
is daily summary

None of this data is streamed realtime but rather copied in and it can be
highly asynchronous and even out of order (I may get a summary for Tuesday
before the one for Monday)

I need to join data from these 3 tables to generate daily statistics but
obviously, I do not want to reprocess everything every day and it would be
got to not do queries unless all the data is actually there.

This is not that hard to code to fix with specific code for this specific
case, but I have a hunch that I should be able to generalize this into a
more generic job dependency scheduler. However, I feel a bit like I am
staring at the forest and cannot see a single tree at the moment :)

Just cannot see a solution that I like and I have a clear feeling there
should be a better way to do it than I can think of.

Good ideas?


View raw message