flink-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Kostas Kloudas <k.klou...@data-artisans.com>
Subject Re: bad data output
Date Tue, 03 Apr 2018 09:58:31 GMT
Hi Darshan,

You can use side outputs [1] and a process function to split the data in as many streams as
you want,
e.g. correct, fixable and wrong. Each side output will be a separate stream that your can
process individually.

You can always send the “bad data” directly from your process function to Kafka or wherever.
You just need
to override the open() method, create a connection to the outside storage system, and use
that connection
to store the data whenever you see them. Keep in mind though, that your process function is
executed on a single thread, so it may be beneficial to split your computation in multiple
functions (although this is up to
you to benchmark and see if it fits you).

[1] https://ci.apache.org/projects/flink/flink-docs-release-1.4/dev/stream/side_output.html

> On Mar 29, 2018, at 8:53 PM, Darshan Singh <darshan.meel@gmail.com> wrote:
> Hi
> I have a dataset which has almost 99% of correct data. As of now if say some data is
bad I just ignore it and log it and return only correct data. I do this inside a map function.
> The part which decides whether data is correct or not is expensive one.
> Now I want to store the bad data somewhere so that I could analyze that data in future.
> So I can run the same calc 2 times and get the correct data in first go and bad data
in 2nd go.
> Is there a better way where I can somehow store the bad data from inside of map function
like send to kafka, file etc?
> Also, is there a way I could create a datastream which can get the data from inside map
function(not sure this is feasible as of now)?
> Thanks

View raw message