oozie-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Harshal Vora" <harshal.v...@komli.com>
Subject Versioning of file during re-run
Date Wed, 23 Nov 2011 14:09:42 GMT


We have job for processing logs. There are multiple log servers that dump file into hdfs and
map reduce jobs process these files.
This process happens every half hour.
Sometimes it may happen that one of the log servers is down and few files are missing. At
that moment, we will go ahead with processing of whatever files are available. But when the
missing files are available say after 5 hours, we want to re run all the jobs that ran for
the past 5 hours. 

We want to do this, because the output dependent on the output of previous instance of the
job and we are keeping a running count in between time intervals and also across time intervals.

>From what I understand, I will have to re-run each co-ordinator or bundle instance within
the last 5 hours. At the same time I will have to stop any new instances from running until
the last 5 hours files are processed and they catch up till all new files are processed.

But the issue that we are facing is, for the previous co-ordinator instances to re-run we
have to delete the previous output files of those co-ordinator instances in hdfs and the re-run
will produce new files. We done want to do that. We want to have something like {path}/{timestamp}/{rev-1}
and for the re-run we want {path}/{timestamp}/{rev-2}. And in the following job when it looks
for coord:current(-1) it should pick up rev2 file. 

>From what I understand this is not possible with oozie. i.e. there is no revisioning of
files if the job has re-run. Or is there any possibility?

Or is there a better approach to do this? using coord:latest(-1)?


  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message