spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From James King <jakwebin...@gmail.com>
Subject Re: Reading Real Time Data only from Kafka
Date Tue, 12 May 2015 14:20:28 GMT
Many thanks both, appreciate the help.

On Tue, May 12, 2015 at 4:18 PM, Cody Koeninger <cody@koeninger.org> wrote:

> Yes, that's what happens by default.
>
> If you want to be super accurate about it, you can also specify the exact
> starting offsets for every topic/partition.
>
> On Tue, May 12, 2015 at 9:01 AM, James King <jakwebinbox@gmail.com> wrote:
>
>> Thanks Cody.
>>
>> Here are the events:
>>
>> - Spark app connects to Kafka first time and starts consuming
>> - Messages 1 - 10 arrive at Kafka then Spark app gets them
>> - Now driver dies
>> - Messages 11 - 15 arrive at Kafka
>> - Spark driver program reconnects
>> - Then Messages 16 - 20 arrive Kafka
>>
>> What I want is that Spark ignores 11 - 15
>> but should process 16 - 20 since they arrived after the driver
>> reconnected to Kafka
>>
>> Is this what happens by default in your suggestion?
>>
>>
>>
>>
>>
>> On Tue, May 12, 2015 at 3:52 PM, Cody Koeninger <cody@koeninger.org>
>> wrote:
>>
>>> I don't think it's accurate for Akhil to claim that the linked library
>>> is "much more flexible/reliable" than what's available in Spark at this
>>> point.
>>>
>>> James, what you're describing is the default behavior for the
>>> createDirectStream api available as part of spark since 1.3.  The kafka
>>> parameter auto.offset.reset defaults to largest, ie start at the most
>>> recent available message.
>>>
>>> This is described at
>>> http://spark.apache.org/docs/latest/streaming-kafka-integration.html
>>>  The createDirectStream api implementation is described in detail at
>>> https://github.com/koeninger/kafka-exactly-once/blob/master/blogpost.md
>>>
>>> If for some reason you're stuck using an earlier version of spark, you
>>> can accomplish what you want simply by starting the job using a new
>>> consumer group (there will be no prior state in zookeeper, so it will start
>>> consuming according to auto.offset.reset)
>>>
>>> On Tue, May 12, 2015 at 7:26 AM, James King <jakwebinbox@gmail.com>
>>> wrote:
>>>
>>>> Very nice! will try and let you know, thanks.
>>>>
>>>> On Tue, May 12, 2015 at 2:25 PM, Akhil Das <akhil@sigmoidanalytics.com>
>>>> wrote:
>>>>
>>>>> Yep, you can try this lowlevel Kafka receiver
>>>>> https://github.com/dibbhatt/kafka-spark-consumer. Its much more
>>>>> flexible/reliable than the one comes with Spark.
>>>>>
>>>>> Thanks
>>>>> Best Regards
>>>>>
>>>>> On Tue, May 12, 2015 at 5:15 PM, James King <jakwebinbox@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> What I want is if the driver dies for some reason and it is restarted
>>>>>> I want to read only messages that arrived into Kafka following the
restart
>>>>>> of the driver program and re-connection to Kafka.
>>>>>>
>>>>>> Has anyone done this? any links or resources that can help explain
>>>>>> this?
>>>>>>
>>>>>> Regards
>>>>>> jk
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Mime
View raw message