spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Cody Koeninger (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (SPARK-20287) Kafka Consumer should be able to subscribe to more than one topic partition
Date Wed, 12 Apr 2017 16:38:41 GMT

    [ https://issues.apache.org/jira/browse/SPARK-20287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15966180#comment-15966180
] 

Cody Koeninger commented on SPARK-20287:
----------------------------------------

The issue here is that the underlying new Kafka consumer api doesn't have a way for a single
consumer to subscribe to multiple partitions, but only read a particular range of messages
from one of them.

The max capacity is just a simple way of dealing with what is basically a LRU cache - if someone
creates topics dynamically and then stops sending messages to them, you don't want to keep
leaking resources.

I'm not claiming there's anything great or elegant about those solutions, but they were pretty
much the most straightforward way to make the direct stream model work with the new kafka
consumer api.

> Kafka Consumer should be able to subscribe to more than one topic partition
> ---------------------------------------------------------------------------
>
>                 Key: SPARK-20287
>                 URL: https://issues.apache.org/jira/browse/SPARK-20287
>             Project: Spark
>          Issue Type: Improvement
>          Components: Structured Streaming
>    Affects Versions: 2.1.0
>            Reporter: Stephane Maarek
>
> As I understand and as it stands, one Kafka Consumer is created for each topic partition
in the source Kafka topics, and they're cached.
> cf https://github.com/apache/spark/blob/master/external/kafka-0-10/src/main/scala/org/apache/spark/streaming/kafka010/CachedKafkaConsumer.scala#L48
> In my opinion, that makes the design an anti pattern for Kafka and highly unefficient:
> - Each Kafka consumer creates a connection to Kafka
> - Spark doesn't leverage the power of the Kafka consumers, which is that it automatically
assigns and balances partitions amongst all the consumers that share the same group.id
> - You can still cache your Kafka consumer even if it has multiple partitions.
> I'm not sure about how that translates to the spark underlying RDD architecture, but
from a Kafka standpoint, I believe creating one consumer per partition is a big overhead,
and a risk as the user may have to increase the spark.streaming.kafka.consumer.cache.maxCapacity
parameter. 
> Happy to discuss to understand the rationale



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message