falcon-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Shwetha G S (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (FALCON-630) late data rerun for process broken in trunk
Date Mon, 25 Aug 2014 05:39:57 GMT

    [ https://issues.apache.org/jira/browse/FALCON-630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14108764#comment-14108764
] 

Shwetha G S commented on FALCON-630:
------------------------------------

{quote}
Why do you need this? How is this different from feedNames? This same property can be overloaded
with input names in the process and feed names in replication, no?
{quote}
Input name is different from feed name and falcon has validation that input names are unique,
but not input feed names. This is useful for a lot of pipelines where data from different
instances are handled differently. For example, the de-duping of events across 2 hours is
done by taking (n-1)th and (n)th hour data for the same input feed. For all the events in
(n-1)th hour, the data is de-duped against (n)th hour events. This process will need to define
2 inputs for the same feed. This is the reason that late data is defined on input names, rather
than on input feeds in process.

Currently, late data for process is broken as the workflow param will have feed names, but
late data section of process has input names. So, the comparison in code is wrong.

> late data rerun for process broken in trunk 
> --------------------------------------------
>
>                 Key: FALCON-630
>                 URL: https://issues.apache.org/jira/browse/FALCON-630
>             Project: Falcon
>          Issue Type: Bug
>          Components: rerun
>    Affects Versions: 0.5
>            Reporter: Samarth Gupta
>            Assignee: Shwetha G S
>            Priority: Blocker
>             Fix For: 0.4
>
>         Attachments: FALCON-630.patch
>
>
> late data rerun for process is not working . it seems like in pre processing record size
is storing data by Feed name and not by input name , due to which late data is never detected.

> {code}
>                     -falconInputFeeds
>                     FETL2-RRLog#FETL-RTBS-PRLog#FETL-RTBS-NPRLog
> {code}
> above even though param in tasktracker logs says InputFeeds , they are actually feed
name. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Mime
View raw message