spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From francois.garil...@typesafe.com
Subject Re: KafkaUtils explicit acks
Date Wed, 10 Dec 2014 10:42:55 GMT
[sorry for the botched half-message]




Hi Mukesh,




There’s been some great work on Spark Streaming reliability lately.

https://www.youtube.com/watch?v=jcJq3ZalXD8


Look at the links from:

https://issues.apache.org/jira/browse/SPARK-3129








I’m not aware of any doc yet (did I miss something ?) but you can look at the ReliableKafkaReceiver’s
test suite:






external/kafka/src/test/scala/org/apache/spark/streaming/kafka/ReliableKafkaStreamSuite.scala


—
FG

On Wed, Dec 10, 2014 at 11:17 AM, Mukesh Jha <me.mukesh.jha@gmail.com>
wrote:

> Hello Guys,
> Any insights on this??
> If I'm not clear enough my question is how can I use kafka consumer and not
> loose any data in cases of failures with spark-streaming.
> On Tue, Dec 9, 2014 at 2:53 PM, Mukesh Jha <me.mukesh.jha@gmail.com> wrote:
>> Hello Experts,
>>
>> I'm working on a spark app which reads data from kafka & persists it in
>> hbase.
>>
>> Spark documentation states the below *[1]* that in case of worker failure
>> we can loose some data. If not how can I make my kafka stream more reliable?
>> I have seen there is a simple consumer *[2]* but I'm not sure if it has
>> been used/tested extensively.
>>
>> I was wondering if there is a way to explicitly acknowledge the kafka
>> offsets once they are replicated in memory of other worker nodes (if it's
>> not already done) to tackle this issue.
>>
>> Any help is appreciated in advance.
>>
>>
>>    1. *Using any input source that receives data through a network* - For
>>    network-based data sources like *Kafka *and Flume, the received input
>>    data is replicated in memory between nodes of the cluster (default
>>    replication factor is 2). So if a worker node fails, then the system can
>>    recompute the lost from the the left over copy of the input data. However,
>>    if the *worker node where a network receiver was running fails, then a
>>    tiny bit of data may be lost*, that is, the data received by the
>>    system but not yet replicated to other node(s). The receiver will be
>>    started on a different node and it will continue to receive data.
>>    2. https://github.com/dibbhatt/kafka-spark-consumer
>>
>> Txz,
>>
>> *Mukesh Jha <me.mukesh.jha@gmail.com>*
>>
> -- 
> Thanks & Regards,
> *Mukesh Jha <me.mukesh.jha@gmail.com>*
Mime
View raw message