spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Suprith T Jain <>
Subject Spark Kafka API tries connecting to dead node for every batch, which increases the processing time
Date Mon, 16 Oct 2017 13:08:55 GMT
Hi guys,

I have a 3 node cluster and i am running a spark streaming job. consider
below example

/*spark-submit* --master yarn-cluster --class
com.huawei.bigdata.spark.examples.FemaleInfoCollectionPrint --jars
/opt/SparkStreamingExample-1.0.jar  /tmp/test 10 test,,

In this case, suppose node is down. Then for every window batch,
spark tries to send a request  to all the nodes.
This code is in the class org.apache.spark.streaming.kafka.KafkaCluster

Function : getPartitionMetadata()
Line : val resp: TopicMetadataResponse = consumer.send(req)

The function getPartitionMetadata() is called from getPartitions() and
findLeaders() which gets called for every batch.

Hence, if the node is down, the connection fails and it wits till the
timeout to happen before continuing which adds to the processing time.

Question :
Is there any way to avoid this ?
In simple words, i do not want spark to send request to the node that is
down for every batch. How can i achieve this ?

View raw message