falcon-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ajay Yadava (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (FALCON-1686) Support for reprocessing
Date Wed, 23 Dec 2015 11:44:46 GMT

    [ https://issues.apache.org/jira/browse/FALCON-1686?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15069538#comment-15069538
] 

Ajay Yadava commented on FALCON-1686:
-------------------------------------

I think this is the use case which I was talking about. You want to reprocess the instances
which were already processed by the old code, this should be solved by the *effective time
update* feature. 

>From what I understand Srikanth Sundarrajan is talking about another use case. That is
the case when you run your process and figure out that the code is correct but you have missed
some instances because of incorrect start date. In this case you want *new instances* for
a time range earlier than the start time and need to update start date of your process to
an earlier time.  This is also a valid use case but won't be solved by effective time update
feature.

> Support for reprocessing
> ------------------------
>
>                 Key: FALCON-1686
>                 URL: https://issues.apache.org/jira/browse/FALCON-1686
>             Project: Falcon
>          Issue Type: Improvement
>    Affects Versions: 0.7
>            Reporter: Mass Dosage
>
> We have a number of ETL jobs which we schedule to run on a regular basis with Falcon.
This works fine. However, we often have cases where we need to run the exact same jobs over
past date ranges in order to reprocess data after a code change. There doesn't seem to be
any easy way to do this in Falcon at the moment. Ideally we'd have a controlled way of saying
"run this process for dates between X and Y". There should also be a way to control whether
downstream processes are triggered by the data being reprocessed or not. In some cases you
may want downstream jobs to also run on the new data but in other cases you might not. 
> With Oozie, if one wants to reprocess data from any time in history, one can update the
start & end-dates (using the job.properties file) and submit a new coordinator to run
alongside the existing one. As the coordinator-ids are unique they do not clash. In Falcon,
processes are defined by their readable name so one would need to update that in the process
file directly. 
> We are currently working around this issue by making a copy of the original Falcon process,
giving it a different name and changing the dates. This isn't ideal and leads to a lot of
XML duplication. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message