spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Bill Jay <bill.jaypeter...@gmail.com>
Subject Spark Streaming + Kafka failure recovery
Date Tue, 19 May 2015 17:42:15 GMT
Hi all,

I am currently using Spark streaming to consume and save logs every hour in
our production pipeline. The current setting is to run a crontab job to
check every minute whether the job is still there and if not resubmit a
Spark streaming job. I am currently using the direct approach for Kafka
consumer. I have two questions:

1. In the direct approach, no offset is stored in zookeeper and no group id
is specified. Can two consumers (one is Spark streaming and the other is a
Kafak console consumer in Kafka package) read from the same topic from the
brokers together (I would like both of them to get all messages, i.e.
publish-subscribe mode)? What about two Spark streaming jobs read from the
same topic?

2. How to avoid data loss if a Spark job is killed? Does checkpointing
serve this purpose? The default behavior of Spark streaming is to read the
latest logs. However, if a job is killed, can the new job resume from what
was left to avoid loosing logs?

Thanks!

Bill

Mime
View raw message