flink-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Seth Wiesman <swies...@mediamath.com>
Subject [Proposal] RichFsSinkFunction
Date Tue, 24 Jan 2017 17:06:28 GMT
Hi all,

I have been working with Flink for a while at work now and in that time I have developed several
extensions that I would like to contribute back. I wanted to reach out with what has been
the most significant modification for and see if it is something that the community would
be interested in.


Currently my pipeline uses the BucketingSink to write out files to S3 which are then consumed
by other processes. Outputting data to a file system is fundamentally different than outputting
to something such as Kafka because data can only be consumed once at the file level. The BucketingSink
moves files through three phases, inprogress, pending, and complete, and if you are interested
in maintaining exactly once guarantees then you only want your external services to consume
files once it reaches a complete state. One option is to write a _SUCCESS file to a bucket
once all files in that bucket are done but that can be difficult to coordinate or may take
a prohibitively long amount of time. In the case of the BasePathBucketer this will never happen.
For my use case it is important to for external services to be able to consume files as soon
as they become available. To solve this, we modified the bucketing sink to not be the end
of the pipeline but instead forward the final path of files on once they reach their final
state to a final operator.

I do not want to fundamentally change the concept of what a sink is, it should remain the
end of the pipeline, instead this is simply to allow a custom ‘onClose’ step. To do this,
paths can only be forwarded on to an operator of parallelism one whose only operation is to
add a sink. From this other services can be notified that completed files exist.

To provide a motivating example, after writing files to S3 I need to load them into a redshift
cluster. To do this I batch completed files for 1 minute of processing time and then write
out a manifest file ( a list of completed files to load) and run a copy command. http://docs.aws.amazon.com/redshift/latest/dg/loading-data-files-using-manifest.html

I have below an example gist of what this looks like to use.

Example: https://gist.github.com/sjwiesman/fc99c64f44a93cfc9c7aa62c070a9358


Seth Wiesman | Data Engineer

4 World Trade Center, 45th Floor, New York, NY 10007<https://www.mediamath.com/mailto>

  • Unnamed multipart/related (inline, None, 0 bytes)
View raw message