spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jahagirdar, Madhu" <madhu.jahagir...@philips.com>
Subject RE: Dstream Transformations
Date Mon, 06 Oct 2014 09:31:37 GMT
Doesn't spark keep track of the DAG lineage and start from where it has stopped ? Does it have
to always start from the beginning of the lineage when the job fails ?

________________________________
From: Massimiliano Tomassi [max.tomassi@gmail.com]
Sent: Monday, October 06, 2014 2:40 PM
To: Jahagirdar, Madhu
Cc: Akhil Das; user
Subject: Re: Dstream Transformations

>From the Spark Streaming Programming Guide (http://spark.apache.org/docs/latest/streaming-programming-guide.html#failure-of-a-worker-node):

...output operations (like foreachRDD) have at-least once semantics, that is, the transformed
data may get written to an external entity more than once in the event of a worker failure.

I think that when a worker fails the entire graph of transformations/actions will be reapplied
again on that RDD. This means that, in your case, both the storing operations will be executed
again. For this reason, in a video I've watched on youtube, they suggest to make all the output
operations idempotent. Obviously not always this is possible unfortunately: e.g. you are building
an analytics system and you need to increment counters.

This is what I've got so far, anyone having a different point of view?

On 6 October 2014 08:59, Jahagirdar, Madhu <madhu.jahagirdar@philips.com<mailto:madhu.jahagirdar@philips.com>>
wrote:
Given that I have multiple worker nodes and when Spark schedules the job again on the worker
nodes that are alive, does it then again store the data in elastic search and then flume or
does it only run functions to store in flume ?

Regards,
Madhu Jahagirdar

________________________________
From: Akhil Das [akhil@sigmoidanalytics.com<mailto:akhil@sigmoidanalytics.com>]
Sent: Monday, October 06, 2014 1:20 PM
To: Jahagirdar, Madhu
Cc: user
Subject: Re: Dstream Transformations

AFAIK spark doesn't restart worker nodes itself. You can have multiple worker nodes and in
that case if one worker node goes down, then spark will try to recompute those lost RDDs again
with those workers who are alive.

Thanks
Best Regards

On Sun, Oct 5, 2014 at 5:19 AM, Jahagirdar, Madhu <madhu.jahagirdar@philips.com<mailto:madhu.jahagirdar@philips.com>>
wrote:
In my spark streaming program I have created kafka utils to receive data and store data in
elastic search and in flume. Storing function is applied on same dstream. My question what
is the behavior of spark if after storing data in elastic search the worker node dies before
storing in flume? Does it  restart worker and then again store the data in elastic search
and then flume or does it only run functions to store in flume.

Regards
Madhu Jahagirdar

________________________________
The information contained in this message may be confidential and legally protected under
applicable law. The message is intended solely for the addressee(s). If you are not the intended
recipient, you are hereby notified that any use, forwarding, dissemination, or reproduction
of this message is strictly prohibited and may be unlawful. If you are not the intended recipient,
please contact the sender by return e-mail and destroy all copies of the original message.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org<mailto:user-unsubscribe@spark.apache.org>
For additional commands, e-mail: user-help@spark.apache.org<mailto:user-help@spark.apache.org>





--
------------------------------------------------
Massimiliano Tomassi
------------------------------------------------
web: http://about.me/maxtomassi
e-mail: max.tomassi@gmail.com<mailto:max.tomassi@gmail.com>
------------------------------------------------

Mime
View raw message