falcon-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Idris Ali <psychid...@gmail.com>
Subject Falcon late arrival and cut off
Date Thu, 25 Jun 2015 18:49:14 GMT
Hi Mahak,

To quickly answer your question.
Scenario 1 : A feed instance runs at 17:30 for replication but a file
ending in 1730 isn't available yet. So, the instance is rescheduled for a
later time and this keeps on happening until the file is found or the late
arrival cut off time (an hour in this case) is reached.
- Assuming its a feed with f*requency minutes(10),* this scenario has
nothing to do with late-data, when the availability flag is ready, the
replication kicks off, otherwise the 17:30 replication instance will be in
"Waiting" state. Once the availability flag is found the instance goes to
"Running" state and replicates the data to target cluster and this instance
17:30 is considered as "Success".

Scenario 2: A feed instance runs at 17:30 for replication and finds that a
file ending in 1720 is now available which wasn't available when the last
replication instance ran(at 17:20). So, now it copies both the files (the
one ending in 1730 and the one ending in 1720).
- No it wont copy data from both the instances, since 17:20 is available
for the first time, it simply copies 17:20's data alone. And feed instance
for 17:30 will check for data under 17:30 directory alone. Both are
independent instances.


Late arrival works for both Feed and Process and the details on the
functionality is available in Falcon documentation.
Please check
http://falcon.apache.org/0.6-incubating/EntitySpecification.html#Feed_Specification
"Late Data" section.


Since your question is related to Feed replication (late-data) I will try
to answer here:
1. From Feed definition, lets say we have
 <frequency>hours(1)</frequency>

<late-arrival cut-off="hours(6)"/>

2. From falcon runtime.properties
A feed cut-off policy is required for late-data handling for Feeds.
allowed policies: periodic, exp-backoff(exponential backoff) and final
Ex: periodic with delay=hours(2),

Here, falcon would replicate the feed once every hour 17:00, 18:00 and so
on.
late-arrival specifies, since how *long this feed should be checked for
late data changes in the Source cluster*. In this case 6 hours.
So, for the instance 17:00, it is honoured till(17+6) 23:00 hour and for
instance 18:00, 00:00 (next day) and so on.

*When to check?* is specified by the cut-off policy, here it says periodic,
hours(2), so falcon checks for changes every 2 hours in source cluster
input.
So, falcon would check the instance 17:00 at time 19:00 for the data in
source cluster, followed by 21:00 and finally at 23:00.

*How changes are detected?* Falcon maintains the data size for every
instance run, so it records the size of data at first run (17:00)
if it detects a different size in source input in next period check 19:00,
it simply reruns the entire replication by *overriding* the previous
replicated data.



Hope it answers your question.

Thanks,
-Idris









On Thu, Jun 25, 2015 at 10:02 PM, Mahak Mukhi <mmukhi@yahoo-inc.com.invalid
<javascript:_e(%7B%7D,'cvml','mmukhi@yahoo-inc.com.invalid');>> wrote:

> Hi,
> I wanted to get a clearer picture on how does falcon handle late arrivals?
> Does it wait for the specific feed instance for cut off time before failing
> or would it look for all files in the time interval (current - cut off) to
> (current).Consider the following 2 scenarios, I'd like to know which one
> corresponds with falcon:
> There's a feed set up for replication with a frequency of 10 minutes and
> the late arrival cut off time is set to be an hour.
> Scenario 1 : A feed instance runs at 17:30 for replication but a file
> ending in 1730 isn't available yet. So, the instance is rescheduled for a
> later time and this keeps on happening until the file is found or the late
> arrival cut off time (an hour in this case) is reached. In latter case, the
> replication job fails.
> Scenario 2: A feed instance runs at 17:30 for replication and finds that a
> file ending in 1720 is now available which wasn't available when the last
> replication instance ran(at 17:20). So, now it copies both the files (the
> one ending in 1730 and the one ending in 1720).
>
> I'm inclined to believe that scenario 1 corresponds with Falcon, however I
> want to confirm that I'm not missing anything.In case, it is Scenario 2,
> how does falcon keep track of what files have been copied?
> Your help is much appreciated. Thanks.
>  Regards,
> Mahak Mukhi
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message