spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chris Fregly <ch...@fregly.com>
Subject Re: Low Level Kafka Consumer for Spark
Date Thu, 28 Aug 2014 18:02:25 GMT
@bharat-

overall, i've noticed a lot of confusion about how Spark Streaming scales -
as well as how it handles failover and checkpointing, but we can discuss
that separately.

there's actually 2 dimensions to scaling here:  receiving and processing.

*Receiving*
receiving can be scaled out by submitting new DStreams/Receivers to the
cluster as i've done in the Kinesis example.  in fact, i purposely chose to
submit multiple receivers in my Kinesis example because i feel it should be
the norm and not the exception - particularly for partitioned and
checkpoint-capable streaming systems like Kafka and Kinesis.   it's the
only way to scale.

a side note here is that each receiver running in the cluster will
immediately replicates to 1 other node for fault-tolerance of that specific
receiver.  this is where the confusion lies.  this 2-node replication is
mainly for failover in case the receiver dies while data is in flight.
 there's still chance for data loss as there's no write ahead log on the
hot path, but this is being addressed.

this in mentioned in the docs here:
https://spark.apache.org/docs/latest/streaming-programming-guide.html#level-of-parallelism-in-data-receiving

*Processing*
once data is received, tasks are scheduled across the Spark cluster just
like any other non-streaming task where you can specify the number of
partitions for reduces, etc.  this is the part of scaling that is sometimes
overlooked - probably because it "works just like regular Spark", but it is
worth highlighting.

Here's a blurb in the docs:
https://spark.apache.org/docs/latest/streaming-programming-guide.html#level-of-parallelism-in-data-processing

the other thing that's confusing with Spark Streaming is that in Scala, you
need to explicitly

import org.apache.spark.streaming.StreamingContext.toPairDStreamFunctions

in order to pick up the implicits that allow DStream.reduceByKey and such
(versus DStream.transform(rddBatch => rddBatch.reduceByKey())

in other words, DStreams appear to be relatively featureless until you
discover this implicit.  otherwise, you need to operate on the underlying
RDD's explicitly which is not ideal.

the Kinesis example referenced earlier in the thread uses the DStream
implicits.


side note to all of this - i've recently convinced my publisher for my
upcoming book, Spark In Action, to let me jump ahead and write the Spark
Streaming chapter ahead of other more well-understood libraries.  early
release is in a month or so.  sign up  @ http://sparkinaction.com if you
wanna get notified.

shameless plug that i wouldn't otherwise do, but i really think it will
help clear a lot of confusion in this area as i hear these questions asked
a lot in my talks and such.  and i think a clear, crisp story on scaling
and fault-tolerance will help Spark Streaming's adoption.

hope that helps!

-chris




On Wed, Aug 27, 2014 at 6:32 PM, Dibyendu Bhattacharya <
dibyendu.bhattachary@gmail.com> wrote:

> I agree. This issue should be fixed in Spark rather rely on replay of
> Kafka messages.
>
> Dib
> On Aug 28, 2014 6:45 AM, "RodrigoB" <rodrigo.boavida@aspect.com> wrote:
>
>> Dibyendu,
>>
>> Tnks for getting back.
>>
>> I believe you are absolutely right. We were under the assumption that the
>> raw data was being computed again and that's not happening after further
>> tests. This applies to Kafka as well.
>>
>> The issue is of major priority fortunately.
>>
>> Regarding your suggestion, I would maybe prefer to have the problem
>> resolved
>> within Spark's internals since once the data is replicated we should be
>> able
>> to access it once more and not having to pool it back again from Kafka or
>> any other stream that is being affected by this issue. If for example
>> there
>> is a big amount of batches to be recomputed I would rather have them done
>> distributed than overloading the batch interval with huge amount of Kafka
>> messages.
>>
>> I do not have yet enough know how on where is the issue and about the
>> internal Spark code so I can't really how much difficult will be the
>> implementation.
>>
>> tnks,
>> Rod
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/Low-Level-Kafka-Consumer-for-Spark-tp11258p12966.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>> For additional commands, e-mail: user-help@spark.apache.org
>>
>>

Mime
View raw message