spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Amit Sela <amitsel...@gmail.com>
Subject Re: StructuredStreaming status
Date Wed, 19 Oct 2016 20:49:11 GMT
I've been working on the Apache Beam Spark runner which is (in this
context) basically running a streaming model that focuses on event-time and
correctness with Spark, and as I see it (even in spark 1.6.x) the
micro-batches are really just added latency, which will work-out for some
users, and not for others and that's OK. Structured Streaming triggers make
it even better with computing on trigger (other systems do it constantly,
but output only on trigger so no much difference there).

I'm actually curious about a couple of things:

   - State store API - having a state API available is extremely useful for
   streaming in many fronts:
      - Available for sources (and sinks ?) to avoid immortalizing
      micro-batch reads (simply let tasks pick-off where they left the previous
      micro-batch).
      - Can help rid of resuming from checkpoint, you can simply restart -
      that goes for upgrading Spark jobs as well as wrapping accumulators and
      broadcasts in getOrCreate methods (an of not resuming from checkpoint you
      can avoid wrapping you DAG construction in getOrCreate as well).
      - The fact that it aims to be pluggable enabling building platforms
      (not just frameworks) with Spark.
      - Finally, it is basically the basis for any stateful computation
      spark will support.
   - Evicting state as Michael pointed out, which currently, if using for
   example overlapping windows grows the Dataset really quickly.
   - Encoders API - Where does it stand ? will developers/users be able to
   define a custom schema for say a generically typed class ? will it get
   along with inner classes, static classes etc. ?

Thanks,
Amit

On Wed, Oct 19, 2016 at 11:30 PM Michael Armbrust <michael@databricks.com>
wrote:

> I know people are seriously thinking about latency.  So far that has not
> been the limiting factor in the users I've been working with.
>
> On Wed, Oct 19, 2016 at 1:11 PM, Cody Koeninger <cody@koeninger.org>
> wrote:
>
> Is anyone seriously thinking about alternatives to microbatches?
>
> On Wed, Oct 19, 2016 at 2:45 PM, Michael Armbrust
> <michael@databricks.com> wrote:
> > Anything that is actively being designed should be in JIRA, and it seems
> > like you found most of it.  In general, release windows can be found on
> the
> > wiki.
> >
> > 2.1 has a lot of stability fixes as well as the kafka support you
> mentioned.
> > It may also include some of the following.
> >
> > The items I'd like to start thinking about next are:
> >  - Evicting state from the store based on event time watermarks
> >  - Sessionization (grouping together related events by key / eventTime)
> >  - Improvements to the query planner (remove some of the restrictions on
> > what queries can be run).
> >
> > This is roughly in order based on what I've been hearing users hit the
> most.
> > Would love more feedback on what is blocking real use cases.
> >
> > On Tue, Oct 18, 2016 at 1:51 AM, Ofir Manor <ofir.manor@equalum.io>
> wrote:
> >>
> >> Hi,
> >> I hope it is the right forum.
> >> I am looking for some information of what to expect from
> >> StructuredStreaming in its next releases to help me choose when / where
> to
> >> start using it more seriously (or where to invest in workarounds and
> where
> >> to wait). I couldn't find a good place where such planning discussed
> for 2.1
> >> (like, for example ML and SPARK-15581).
> >> I'm aware of the 2.0 documented limits
> >> (
> http://spark.apache.org/docs/2.0.1/structured-streaming-programming-guide.html#unsupported-operations
> ),
> >> like no support for multiple aggregations levels, joins are strictly to
> a
> >> static dataset (no SCD or stream-stream) etc, limited sources / sinks
> (like
> >> no sink for interactive queries) etc etc
> >> I'm also aware of some changes that have landed in master, like the new
> >> Kafka 0.10 source (and its on-going improvements) in SPARK-15406, the
> >> metrics in SPARK-17731, and some improvements for the file source.
> >> If I remember correctly, the discussion on Spark release cadence
> concluded
> >> with a preference to a four-month cycles, with likely code freeze pretty
> >> soon (end of October). So I believe the scope for 2.1 should likely
> quite
> >> clear to some, and that 2.2 planning should likely be starting about
> now.
> >> Any visibility / sharing will be highly appreciated!
> >> thanks in advance,
> >>
> >> Ofir Manor
> >>
> >> Co-Founder & CTO | Equalum
> >>
> >> Mobile: +972-54-7801286 <054-780-1286> | Email: ofir.manor@equalum.io
> >
> >
>
>
>

Mime
View raw message