flink-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Leith Mudge <lei...@palamir.com>
Subject Re: Using Kafka and Flink for batch processing of a batch data source
Date Wed, 20 Jul 2016 23:57:34 GMT
Thanks Milind & Till,

This is what I thought from my reading of the documentation but it is nice to have it confirmed
by people more knowledgeable.

Supplementary to this question is whether Flink is the best choice for batch processing at
this point in time or whether I would be better to look at a more mature and dedicated batch
processing engine such as Spark? I do like the choices that adopting the unified programming
model outlined in Apache Beam/Google Cloud Dataflow SDK and this purports to have runners
for both Flink and Spark.

Regards,

Leith
From: Till Rohrmann <trohrmann@apache.org>
Date: Wednesday, 20 July 2016 at 5:05 PM
To: <user@flink.apache.org>
Subject: Re: Using Kafka and Flink for batch processing of a batch data source

At the moment there is also no batch source for Kafka. I'm also not so sure how you would
define a batch given a Kafka stream. Only reading till a certain offset? Or maybe until one
has read n messages?

I think it's best to write the batch data to HDFS or another batch data store.

Cheers,
Till

On Wed, Jul 20, 2016 at 8:08 AM, milind parikh <milindsparikh@gmail.com<mailto:milindsparikh@gmail.com>>
wrote:

It likely does not make sense to publish a file ( "batch data") into Kafka; unless the file
is very small.

An improvised pub-sub mechanism for Kafka could be to (a) write the file into a persistent
store outside of kafka (b) publishing of a message into Kafka about that write so as to enable
processing of that file.

If you really needed to have provenance around processing, you could route data processing
through Nifi before Flink.

Regards
Milind

On Jul 19, 2016 9:37 PM, "Leith Mudge" <leithm@palamir.com<mailto:leithm@palamir.com>>
wrote:

I am currently working on an architecture for a big data streaming and batch processing platform.
I am planning on using Apache Kafka for a distributed messaging system to handle data from
streaming data sources and then pass on to Apache Flink for stream processing. I would also
like to use Flink's batch processing capabilities to process batch data.

Does it make sense to pass the batched data through Kafka on a periodic basis as a source
for Flink batch processing (is this even possible?) or should I just write the batch data
to a data store and then process by reading into Flink?

________________________________

| All rights in this email and any attached documents or files are expressly reserved. This
e-mail, and any files transmitted with it, contains confidential information which may be
subject to legal privilege. If you are not the intended recipient, please delete it and notify
Palamir Pty Ltd by e-mail. Palamir Pty Ltd does not warrant this transmission or attachments
are free from viruses or similar malicious code and does not accept liability for any consequences
to the recipient caused by opening or using this e-mail. For the legal protection of our business,
any email sent or received by us may be monitored or intercepted. | Please consider the environment
before printing this email. |


________________________________

| All rights in this email and any attached documents or files are expressly reserved. This
e-mail, and any files transmitted with it, contains confidential information which may be
subject to legal privilege. If you are not the intended recipient, please delete it and notify
Palamir Pty Ltd by e-mail. Palamir Pty Ltd does not warrant this transmission or attachments
are free from viruses or similar malicious code and does not accept liability for any consequences
to the recipient caused by opening or using this e-mail. For the legal protection of our business,
any email sent or received by us may be monitored or intercepted. | Please consider the environment
before printing this email. |
Mime
View raw message