spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sean Owen <so...@cloudera.com>
Subject Re: Kafka DStream Parallelism
Date Sat, 28 Feb 2015 02:56:00 GMT
The coarsest level at which you can parallelize is topic. Topics are
all but unrelated to each other so can be consumed independently. But
you can parallelize within the context of a topic too.

A Kafka group ID defines a consumer group. One consumer in a group
receive each message to the topic that group is listening to. Topics
can have partitions too. You can thus make N consumers in a group
listening to N partitions and each will effectively be listening to a
partition.

Yes, my understanding is that multiple receivers in one group are the
way to consume a topic's partitions in parallel.

On Sat, Feb 28, 2015 at 12:56 AM, Corey Nolet <cjnolet@gmail.com> wrote:
> Looking @ [1], it seems to recommend pull from multiple Kafka topics in
> order to parallelize data received from Kafka over multiple nodes. I notice
> in [2], however, that one of the createConsumer() functions takes a groupId.
> So am I understanding correctly that creating multiple DStreams with the
> same groupId allow data to be partitioned across many nodes on a single
> topic?
>
> [1]
> http://spark.apache.org/docs/1.2.0/streaming-programming-guide.html#level-of-parallelism-in-data-receiving
> [2]
> https://spark.apache.org/docs/1.2.0/api/scala/index.html#org.apache.spark.streaming.kafka.KafkaUtils$

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org


Mime
View raw message