spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dominik Safaric <dominiksafa...@gmail.com>
Subject Re: Spark Streaming 2.1 Kafka consumer - retrieving offset commits for each poll
Date Wed, 26 Apr 2017 18:17:35 GMT
The reason why I want to obtain this information, i.e. <partition, offset, timestamp>
tuples is to relate the consumption with the production rates using the __consumer_offsets
Kafka internal topic. Interestedly, the Spark’s KafkaConsumer implementation does not auto
commit the offsets upon offset commit expiration, because as seen in the logs, Spark overrides
the enable.auto.commit property to false. 

Any idea onto how to use the KafkaConsumer’s auto offset commits? Keep in mind that I do
not care about exactly-once, hence having messages replayed is perfectly fine.   

> On 26 Apr 2017, at 19:26, Cody Koeninger <cody@koeninger.org> wrote:
> 
> What is it you're actually trying to accomplish?
> 
> You can get topic, partition, and offset bounds from an offset range like
> 
> http://spark.apache.org/docs/latest/streaming-kafka-0-10-integration.html#obtaining-offsets
> 
> Timestamp isn't really a meaningful idea for a range of offsets.
> 
> 
> On Tue, Apr 25, 2017 at 2:43 PM, Dominik Safaric
> <dominiksafaric@gmail.com> wrote:
>> Hi all,
>> 
>> Because the Spark Streaming direct Kafka consumer maps offsets for a given
>> Kafka topic and a partition internally while having enable.auto.commit set
>> to false, how can I retrieve the offset of each made consumer’s poll call
>> using the offset ranges of an RDD? More precisely, the information I seek to
>> get after each poll call is the following: <timestamp, offset, partition>.
>> 
>> Thanks in advance,
>> Dominik
>> 


---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org


Mime
View raw message