spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "OUASSAIDI, Sami" <sami.ouassa...@mind7.fr>
Subject Re: [Spark Streaming+Kafka][How-to]
Date Wed, 22 Mar 2017 00:35:35 GMT
So it worked quite well with a coalesce, I was able to find an solution to
my problem : Altough not directly handling the executor a good roundaway
was to assign the desired partition to a specific stream through assign
strategy and coalesce to a single partition then repeat the same process
for the remaining topics on different streams and at the end do a an union
of these streams.

PS : No shuffle was made during the whole thing since the rdd partitions
were collapsed to a single one

Le 17 mars 2017 8:04 PM, "Michael Armbrust" <michael@databricks.com> a
écrit :

> Another option that would avoid a shuffle would be to use assign and
> coalesce, running two separate streams.
>
> spark.readStream
>   .format("kafka")
>   .option("kafka.bootstrap.servers", "...")
>   .option("assign", """{t0: {"0": xxxx}, t1:{"0": xxxxx}}""")
>   .load()
>   .coalesce(1)
>   .writeStream
>   .foreach(... code to write to cassandra ...)
>
> spark.readStream
>   .format("kafka")
>   .option("kafka.bootstrap.servers", "...")
>   .option("assign", """{t0: {"1": xxxx}, t1:{"1": xxxxx}}""")
>   .load()
>   .coalesce(1)
>   .writeStream
>   .foreach(... code to write to cassandra ...)
>
> On Fri, Mar 17, 2017 at 7:35 AM, OUASSAIDI, Sami <sami.ouassaidi@mind7.fr>
> wrote:
>
>> @Cody : Duly noted.
>> @Michael Ambrust : A repartition is out of the question for our project
>> as it would be a fairly expensive operation. We tried looking into
>> targeting a specific executor so as to avoid this extra cost and directly
>> have well partitioned data after consuming the kafka topics. Also we are
>> using Spark streaming to save to the cassandra DB and try to keep shuffle
>> operations to a strict minimum (at best none). As of now we are not
>> entirely pleased with our current performances, that's why I'm doing a
>> kafka topic sharding POC and getting the executor to handle the specificied
>> partitions is central.
>> ᐧ
>>
>> 2017-03-17 9:14 GMT+01:00 Michael Armbrust <michael@databricks.com>:
>>
>>> Sorry, typo.  Should be a repartition not a groupBy.
>>>
>>>
>>>> spark.readStream
>>>>   .format("kafka")
>>>>   .option("kafka.bootstrap.servers", "...")
>>>>   .option("subscribe", "t0,t1")
>>>>   .load()
>>>>   .repartition($"partition")
>>>>   .writeStream
>>>>   .foreach(... code to write to cassandra ...)
>>>>
>>>
>>
>>
>> --
>> *Mind7 Consulting*
>>
>> Sami Ouassaid | Consultant Big Data | sami.ouassaidi@mind7.com
>> __
>>
>> 64 Rue Taitbout, 75009 Paris
>>
>
>

Mime
View raw message