spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From <>
Subject RE: Node afinity for Kafka-Direct Stream
Date Wed, 14 Oct 2015 16:00:19 GMT

Another point is the in the receiver based approach, all the data from kafka first goes to
the Worker where the receiver runs

Also if you create one stream (which is the normal case), and you have many worker instances,
only one worker does all the reading. Once that worker reads, the data can be “repartitioned”
to distribute the load. This repartitioning is a data movement overhead in the receiver based
In Receiver approach:
Multiple Kafka input DStreams can be created with different groups and topics for parallel
receiving of data using multiple receivers.

In Direct approach:
Simplified Parallelism: No need to create multiple input Kafka streams and union them.

From: Gerard Maas []
Sent: 14 October 2015 18:53
To: Saisai Shao <>
Cc: Rishitesh Mishra <>; spark users <>
Subject: Re: Node afinity for Kafka-Direct Stream

Thanks Saisai, Mishra,

Indeed, that hint will only work on a case where the Spark executor is co-located with the
Kafka broker.
I think the answer to my question as stated  is that there's no warranty of where the task
will execute as it will depend on the scheduler and cluster resources available  (Mesos in
our case).
Therefore, any assumptions made about data locality using the consumer-based approach need
to be reconsidered when migrating to the direct stream.

((In our case, we were using local caches to decide when a given secondary index for a record
should be produced and written.))

-kr, Gerard.

On Wed, Oct 14, 2015 at 2:58 PM, Saisai Shao <<>>
This preferred locality is a hint to spark to schedule Kafka tasks on the preferred nodes,
if Kafka and Spark are two separate cluster, obviously this locality hint takes no effect,
and spark will schedule tasks following node-local -> rack-local -> any pattern, like
any other spark tasks.

On Wed, Oct 14, 2015 at 8:10 PM, Rishitesh Mishra <<>>
Hi Gerard,
I am also trying to understand the same issue. Whatever code I have seen it looks like once
Kafka RDD is constructed the execution of that RDD is upto the task scheduler and it can schedule
the partitions based on the load on nodes. There is preferred node specified in Kafks RDD.
But ASFIK it maps to the Kafka partitions host . So if Kafka and Spark are co hosted probably
this will work. If not, I am not sure how to get data locality for a partition.
correct me if there is a way.

On Wed, Oct 14, 2015 at 3:08 PM, Gerard Maas <<>>
In the receiver-based kafka streaming model, given that each receiver starts as a long-running
task, one can rely in a certain degree of data locality based on the kafka partitioning: 
Data published on a given topic/partition will land on the same spark streaming receiving
node until the receiver dies and needs to be restarted somewhere else.

As I understand, the direct-kafka streaming model just computes offsets and relays the work
to a KafkaRDD. How is the execution locality compared to the receiver-based approach?

thanks, Gerard.


Rishitesh Mishra,
SnappyData . (

The information contained in this electronic message and any attachments to this message are
intended for the exclusive use of the addressee(s) and may contain proprietary, confidential
or privileged information. If you are not the intended recipient, you should not disseminate,
distribute or copy this e-mail. Please notify the sender immediately and destroy all copies
of this message and any attachments. WARNING: Computer viruses can be transmitted via email.
The recipient should check this email and any attachments for the presence of viruses. The
company accepts no liability for any damage caused by any virus transmitted by this email.
View raw message