falcon-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Balu Vellanki (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (FALCON-1728) Process entity definition allows multiple clusters when it has output Feed defined.
Date Wed, 06 Jan 2016 21:58:39 GMT

    [ https://issues.apache.org/jira/browse/FALCON-1728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15086358#comment-15086358

Balu Vellanki commented on FALCON-1728:

[~pavan kumar] and [~ajayyadava] : Say you have an output feed FeedOne whose source cluster
is ClusterOne and target cluster is ClusterTwo. The location of the feed is /apps/falcon/feedOne/location/{YEAR}-{MONTH}-{DAY}-{HOUR}

Now you have a process ProcessOne whose output feed is FeedOne. The process is run on clusters
ClusterTwo and ClusterThree. When oozie runs the process instance, the user expects the output
data to be generated in ClusterOne/apps/falcon/feedOne/location/{YEAR}-{MONTH}-{DAY}-{HOUR}.
The user also expects this dir to be replicated to ClusterTwo/apps/falcon/feedOne/location/{YEAR}-{MONTH}-{DAY}-{HOUR}.
  Now, if two jobs for same process instance on two different clusters are writing to the
same dir ClusterOne/apps/falcon/feedOne/location/{YEAR}-{MONTH}-{DAY}-{HOUR} , wont this be
a problem?

If the process is run on ClusterThree and the output is written to ClusterThree/apps/falcon/feedOne/location/{YEAR}-{MONTH}-{DAY}-{HOUR}
instead of location in ClusterOne, I think it is a bug.   

[~venkatnrangan] and [~sriksun] : What do you think?

> Process entity definition allows multiple clusters when it has output Feed defined. 
> ------------------------------------------------------------------------------------
>                 Key: FALCON-1728
>                 URL: https://issues.apache.org/jira/browse/FALCON-1728
>             Project: Falcon
>          Issue Type: Bug
>          Components: process
>    Affects Versions: 0.9
>            Reporter: Balu Vellanki
>            Assignee: Balu Vellanki
>            Priority: Critical
> Process XSD allows user to specify multiple clusters per process entity. I am guessing
this would allow a user to run duplicate instance of the process on multiple clusters at the
same time (I do not really see a need for this). When the process has an output feed defined,
you can have duplicate process instances writing to same feed instance, causing data corruption/failures.
The solution is to 
> 1. Do not allow multiple clusters per process. Let the user define a duplicate process
if user wants to run duplicate instances.  
> OR
> 2. Allow multiple clusters, but only when there is no output feed defined.
> [~sriksun] please let me know if there is any other reason for allowing multiple clusters
in a process. 

This message was sent by Atlassian JIRA

View raw message