spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Tathagata Das (JIRA)" <j...@apache.org>
Subject [jira] [Created] (SPARK-23541) Allow Kafka source to read data with greater parallelism than the number of topic-partitions
Date Thu, 01 Mar 2018 01:28:00 GMT
Tathagata Das created SPARK-23541:
-------------------------------------

             Summary: Allow Kafka source to read data with greater parallelism than the number
of topic-partitions
                 Key: SPARK-23541
                 URL: https://issues.apache.org/jira/browse/SPARK-23541
             Project: Spark
          Issue Type: New Feature
          Components: Structured Streaming
    Affects Versions: 2.3.0
            Reporter: Tathagata Das
            Assignee: Tathagata Das


Currently, when the Kafka source reads from Kafka, it generates as many tasks as the number
of partitions in the topic(s) to be read. In some case, it may be beneficial to read the data
with greater parallelism, that is, with more number partitions/tasks. That means, offset ranges
must be divided up into smaller ranges such the number of records in partition ~= total records
in batch / desired partitions. This would also balance out any data skews between topic-partitions.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message