spark-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From JoshRosen <...@git.apache.org>
Subject [GitHub] spark pull request: [SPARK-6128][Streaming][Documentation] Updates...
Date Mon, 09 Mar 2015 23:44:30 GMT
Github user JoshRosen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/4956#discussion_r26087878
  
    --- Diff: docs/streaming-programming-guide.md ---
    @@ -1868,13 +1961,38 @@ Furthermore, there are two kinds of failures that we should be
concerned about:
     
     With this basic knowledge, let us understand the fault-tolerance semantics of Spark Streaming.
     
    -## Semantics with files as input source
    +## Definitions
    +{:.no_toc}
    +The semantics of streaming systems are often captured in terms of how many times each
record can be processed by the system. There are three types of guarantees that a system can
provide under all possible operating conditions (despite failures, etc.)
    +
    +1. *At most once*: Each record will be either processed once or not processed at all.
    +2. *At least once*: Each record will be processed one or more times. This is stronger
than *at-most once* as it ensure that no data will be lost. But there may be duplicates.
    +3. *Exactly once*: Each record will be processed exactly once - no data will be lost
and no data will be processed multiple times. This is obviously the strongest guarantee of
the three.
    +
    +## Basic Semantics
    +{:.no_toc}
    +In any stream processing system, broadly speaking, there are three steps in processing
the data.
    +1. *Receiving the data*: The data is received from sources using Receivers or otherwise.
    +1. *Transforming the data*: The data received data is transformed using DStream and RDD
transformations.
    +1. *Pushing out the data*: The final transformed data is pushed out to external systems
like file systems, databases, dashboards, etc.
    +
    +If a streaming application has to achieve end-to-end exactly-once guarantees, then each
step has to provide exactly-once guarantee. That is, each record must be received exactly
once, transformed exactly once, and pushed to downstream systems exactly once. In case of
Spark Streaming, lets understand the scope of Spark Streaming.
    --- End diff --
    
    lets -> let's.
    
    Also: "In case of Spark Streaming, let's understand the scope of Spark Streaming" sounds
a little ["By installing Java, you will be able to experience the power of Java"](http://www.joelonsoftware.com/items/2009/01/12.html)
to me.  I guess that this sentence is trying to say that we need to clearly define the boundary
of Spark Streaming vs. external systems in order to meaningfully talk about guarantees (e.g.
it can't guarantee transactional behavior of downstream systems, etc.), e.g. let's be clear
about the scope of where these guarantees hold.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Mime
View raw message