spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Matei Zaharia <matei.zaha...@gmail.com>
Subject Re: StructuredStreaming status
Date Thu, 20 Oct 2016 04:40:40 GMT
Yeah, as Shivaram pointed out, there have been research projects that looked at it. Also, Structured
Streaming was explicitly designed to not make microbatching part of the API or part of the
output behavior (tying triggers to it). However, when people begin working on that is a function
of demand relative to other features. I don't think we can commit to one plan before exploring
more options, but basically there is Shivaram's project, which adds a few new concepts to
the scheduler, and there's the option to reduce control plane latency in the current system,
which hasn't been heavily optimized yet but should be doable (lots of systems can handle 10,000s
of RPCs per second).

Matei

> On Oct 19, 2016, at 9:20 PM, Cody Koeninger <cody@koeninger.org> wrote:
> 
> I don't think it's just about what to target - if you could target 1ms batches, without
harming 1 second or 1 minute batches.... why wouldn't you?
> I think it's about having a clear strategy and dedicating resources to it. If  scheduling
batches at an order of magnitude or two lower latency is the strategy, and that's actually
feasible, that's great. But I haven't seen that clear direction, and this is by no means a
recent issue.
> 
> 
> On Oct 19, 2016 7:36 PM, "Matei Zaharia" <matei.zaharia@gmail.com <mailto:matei.zaharia@gmail.com>>
wrote:
> I'm also curious whether there are concerns other than latency with the way stuff executes
in Structured Streaming (now that the time steps don't have to act as triggers), as well as
what latency people want for various apps.
> 
> The stateful operator designs for streaming systems aren't inherently "better" than micro-batching
-- they lose a lot of stuff that is possible in Spark, such as load balancing work dynamically
across nodes, speculative execution for stragglers, scaling clusters up and down elastically,
etc. Moreover, Spark itself could execute the current model with much lower latency. The question
is just what combinations of latency, throughput, fault recovery, etc to target.
> 
> Matei
> 
>> On Oct 19, 2016, at 2:18 PM, Amit Sela <amitsela33@gmail.com <mailto:amitsela33@gmail.com>>
wrote:
>> 
>> 
>> 
>> On Thu, Oct 20, 2016 at 12:07 AM Shivaram Venkataraman <shivaram@eecs.berkeley.edu
<mailto:shivaram@eecs.berkeley.edu>> wrote:
>> At the AMPLab we've been working on a research project that looks at
>> just the scheduling latencies and on techniques to get lower
>> scheduling latency. It moves away from the micro-batch model, but
>> reuses the fault tolerance etc. in Spark. However we haven't yet
>> figure out all the parts in integrating this with the rest of
>> structured streaming. I'll try to post a design doc / SIP about this
>> soon.
>> 
>> On a related note - are there other problems users face with
>> micro-batch other than latency ?
>> I think that the fact that they serve as an output trigger is a problem, but Structured
Streaming seems to resolve this now.  
>> 
>> Thanks
>> Shivaram
>> 
>> On Wed, Oct 19, 2016 at 1:29 PM, Michael Armbrust
>> <michael@databricks.com <mailto:michael@databricks.com>> wrote:
>> > I know people are seriously thinking about latency.  So far that has not
>> > been the limiting factor in the users I've been working with.
>> >
>> > On Wed, Oct 19, 2016 at 1:11 PM, Cody Koeninger <cody@koeninger.org <mailto:cody@koeninger.org>>
wrote:
>> >>
>> >> Is anyone seriously thinking about alternatives to microbatches?
>> >>
>> >> On Wed, Oct 19, 2016 at 2:45 PM, Michael Armbrust
>> >> <michael@databricks.com <mailto:michael@databricks.com>> wrote:
>> >> > Anything that is actively being designed should be in JIRA, and it
seems
>> >> > like you found most of it.  In general, release windows can be found
on
>> >> > the
>> >> > wiki.
>> >> >
>> >> > 2.1 has a lot of stability fixes as well as the kafka support you
>> >> > mentioned.
>> >> > It may also include some of the following.
>> >> >
>> >> > The items I'd like to start thinking about next are:
>> >> >  - Evicting state from the store based on event time watermarks
>> >> >  - Sessionization (grouping together related events by key / eventTime)
>> >> >  - Improvements to the query planner (remove some of the restrictions
on
>> >> > what queries can be run).
>> >> >
>> >> > This is roughly in order based on what I've been hearing users hit
the
>> >> > most.
>> >> > Would love more feedback on what is blocking real use cases.
>> >> >
>> >> > On Tue, Oct 18, 2016 at 1:51 AM, Ofir Manor <ofir.manor@equalum.io
<mailto:ofir.manor@equalum.io>>
>> >> > wrote:
>> >> >>
>> >> >> Hi,
>> >> >> I hope it is the right forum.
>> >> >> I am looking for some information of what to expect from
>> >> >> StructuredStreaming in its next releases to help me choose when
/ where
>> >> >> to
>> >> >> start using it more seriously (or where to invest in workarounds
and
>> >> >> where
>> >> >> to wait). I couldn't find a good place where such planning discussed
>> >> >> for 2.1
>> >> >> (like, for example ML and SPARK-15581).
>> >> >> I'm aware of the 2.0 documented limits
>> >> >>
>> >> >> (http://spark.apache.org/docs/2.0.1/structured-streaming-programming-guide.html#unsupported-operations
<http://spark.apache.org/docs/2.0.1/structured-streaming-programming-guide.html#unsupported-operations>),
>> >> >> like no support for multiple aggregations levels, joins are strictly
to
>> >> >> a
>> >> >> static dataset (no SCD or stream-stream) etc, limited sources /
sinks
>> >> >> (like
>> >> >> no sink for interactive queries) etc etc
>> >> >> I'm also aware of some changes that have landed in master, like
the new
>> >> >> Kafka 0.10 source (and its on-going improvements) in SPARK-15406,
the
>> >> >> metrics in SPARK-17731, and some improvements for the file source.
>> >> >> If I remember correctly, the discussion on Spark release cadence
>> >> >> concluded
>> >> >> with a preference to a four-month cycles, with likely code freeze
>> >> >> pretty
>> >> >> soon (end of October). So I believe the scope for 2.1 should likely
>> >> >> quite
>> >> >> clear to some, and that 2.2 planning should likely be starting
about
>> >> >> now.
>> >> >> Any visibility / sharing will be highly appreciated!
>> >> >> thanks in advance,
>> >> >>
>> >> >> Ofir Manor
>> >> >>
>> >> >> Co-Founder & CTO | Equalum
>> >> >>
>> >> >> Mobile: +972-54-7801286 <tel:054-780-1286> | Email: ofir.manor@equalum.io
<mailto:ofir.manor@equalum.io>
>> >> >
>> >> >
>> >
>> >
>> 
>> ---------------------------------------------------------------------
>> To unsubscribe e-mail: dev-unsubscribe@spark.apache.org <mailto:dev-unsubscribe@spark.apache.org>
>> 
> 


Mime
View raw message