spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Cody Koeninger <c...@koeninger.org>
Subject Re: Spark Streaming over YARN
Date Fri, 02 Oct 2015 16:43:00 GMT
Direct stream has nothing to do with Zookeeper.

The direct stream can start at the offsets you specify.  If you're not
storing offsets in checkpoints, how and where you store them is up to you.

Have you read / watched the information linked from

https://github.com/koeninger/kafka-exactly-once


On Fri, Oct 2, 2015 at 11:36 AM, <nibiau@free.fr> wrote:

> Sorry, I just said that I NEED to manage offsets, so in case of Kafka
> Direct Stream , how can I handle this ?
> Update Zookeeper manually ? why not but any other solutions ?
>
> ----- Mail original -----
> De: "Cody Koeninger" <cody@koeninger.org>
> À: "Nicolas Biau" <nibiau@free.fr>
> Cc: "user" <user@spark.apache.org>
> Envoyé: Vendredi 2 Octobre 2015 18:29:09
> Objet: Re: Spark Streaming over YARN
>
>
> Neither of those statements are true.
> You need more receivers if you want more parallelism.
> You don't have to manage offset positioning with the direct stream if you
> don't want to, as long as you can accept the limitations of Spark
> checkpointing.
>
>
> On Fri, Oct 2, 2015 at 10:52 AM, < nibiau@free.fr > wrote:
>
>
> From my understanding as soon as I use YARN I don't need to use
> parrallelisme (at least for RDD treatment)
> I don't want to use direct stream as I have to manage the offset
> positionning (in order to be able to start from the last offset treated
> after a spark job failure)
>
>
> ----- Mail original -----
> De: "Cody Koeninger" < cody@koeninger.org >
> À: "Nicolas Biau" < nibiau@free.fr >
> Cc: "user" < user@spark.apache.org >
> Envoyé: Vendredi 2 Octobre 2015 17:43:41
> Objet: Re: Spark Streaming over YARN
>
>
>
>
> If you're using the receiver based implementation, and want more
> parallelism, you have to create multiple streams and union them together.
>
>
> Or use the direct stream.
>
>
> On Fri, Oct 2, 2015 at 10:40 AM, < nibiau@free.fr > wrote:
>
>
> Hello,
> I have a job receiving data from kafka (4 partitions) and persisting data
> inside MongoDB.
> It works fine, but when I deploy it inside YARN cluster (4 nodes with 2
> cores) only on node is receiving all the kafka partitions and only one node
> is processing my RDD treatment (foreach function)
> How can I force YARN to use all the resources nodes and cores to process
> the data (receiver & RDD treatment)
>
> Tks a lot
> Nicolas
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> For additional commands, e-mail: user-help@spark.apache.org
>
>
>
>

Mime
View raw message