spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
Subject Re: Spark Streaming over YARN
Date Fri, 02 Oct 2015 15:52:12 GMT
>From my understanding as soon as I use YARN I don't need to use parrallelisme (at least
for RDD treatment)
I don't want to use direct stream as I have to manage the offset positionning (in order to
be able to start from the last offset treated after a spark job failure) 

----- Mail original -----
De: "Cody Koeninger" <>
À: "Nicolas Biau" <>
Cc: "user" <>
Envoyé: Vendredi 2 Octobre 2015 17:43:41
Objet: Re: Spark Streaming over YARN

If you're using the receiver based implementation, and want more parallelism, you have to
create multiple streams and union them together. 

Or use the direct stream. 

On Fri, Oct 2, 2015 at 10:40 AM, < > wrote: 

I have a job receiving data from kafka (4 partitions) and persisting data inside MongoDB.

It works fine, but when I deploy it inside YARN cluster (4 nodes with 2 cores) only on node
is receiving all the kafka partitions and only one node is processing my RDD treatment (foreach
How can I force YARN to use all the resources nodes and cores to process the data (receiver
& RDD treatment) 

Tks a lot 

To unsubscribe, e-mail: 
For additional commands, e-mail: 

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message