falcon-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Srikanth Sundarrajan (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (FALCON-267) Add CDC feature
Date Thu, 30 Jan 2014 02:27:06 GMT

    [ https://issues.apache.org/jira/browse/FALCON-267?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13886193#comment-13886193

Srikanth Sundarrajan commented on FALCON-267:

This feature exists in a crude shape/form today in Falcon. There is this tag in the feed definition
called "late-cut-off", which is the time limit within which change is monitored and when a
process has a late input (which means the feed changed), the process is re-executed. I had
proposed the idea of creating recipes over falcon system to achieve some common data management
objectives and this seems a nice fit. I will pen down my thoughts and share on the dev-list.

> Add CDC feature
> ---------------
>                 Key: FALCON-267
>                 URL: https://issues.apache.org/jira/browse/FALCON-267
>             Project: Falcon
>          Issue Type: New Feature
>            Reporter: Jean-Baptiste Onofré
>            Assignee: Jean-Baptiste Onofré
> I propose to add a Change Data Capture feature in Falcon.
> The idea is to be able to catch the change, firstly on HDFS files, publish the identified
gap to a messaging topic.
> It's what I would like to PoC:
> - in a feed definition, we had a <capture/> element defining the change check interval.
> - we create a coordinator in oozie which execute the following workflow at capture interval
> - in the Falcon staging "capture" location on HDFS, we keep the first state of the feed.
We compare (diff) the current content with the staging location, and write the diff in the
Falcon staging. If the file is a binary, we can detect a change (using MD5 for instance) and
the diff is the complete file (like in svn, git, etc).
> - if we have a diff, we publish a message in the Falcon "capture" topic (containing a
set of JMS properties and the message body contains the link to the diff (on HDFS, in the
Falcon staging). The "stream" copy is ovewritten by the new one.
> The purpose of this CDC is:
> 1/ thanks to the publication on the topic, to be able to use "external" tools to "react"
when a change occurs. For instance, I plan to make a demo with an Apache Camel route (sending
e-mails for example) when data change.
> 2/ staying in falcon/oozie/hadoop, to be able to setup a pipeline triggered by data change:
for instance, trigger a job when the data change.
> The first PoC is HDFS/fs centric but I think we can do diff on HBase or Hive.
> Thoughts ?

This message was sent by Atlassian JIRA

View raw message