Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id D2C6B200BA3 for ; Thu, 20 Oct 2016 12:35:41 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id D1599160AE0; Thu, 20 Oct 2016 10:35:41 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 7D07A160ADB for ; Thu, 20 Oct 2016 12:35:40 +0200 (CEST) Received: (qmail 93456 invoked by uid 500); 20 Oct 2016 10:35:34 -0000 Mailing-List: contact dev-help@spark.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list dev@spark.apache.org Received: (qmail 93444 invoked by uid 99); 20 Oct 2016 10:35:33 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd1-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 20 Oct 2016 10:35:33 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd1-us-west.apache.org (ASF Mail Server at spamd1-us-west.apache.org) with ESMTP id 3CEB0C1862 for ; Thu, 20 Oct 2016 10:35:33 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd1-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 2.629 X-Spam-Level: ** X-Spam-Status: No, score=2.629 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, FREEMAIL_ENVFROM_END_DIGIT=0.25, HTML_MESSAGE=2, RCVD_IN_DNSWL_NONE=-0.0001, RCVD_IN_MSPIKE_H3=-0.01, RCVD_IN_MSPIKE_WL=-0.01, RCVD_IN_SORBS_SPAM=0.5, SPF_PASS=-0.001] autolearn=disabled Authentication-Results: spamd1-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx1-lw-us.apache.org ([10.40.0.8]) by localhost (spamd1-us-west.apache.org [10.40.0.7]) (amavisd-new, port 10024) with ESMTP id 3L9GSGtOd8sL for ; Thu, 20 Oct 2016 10:35:31 +0000 (UTC) Received: from mail-qk0-f178.google.com (mail-qk0-f178.google.com [209.85.220.178]) by mx1-lw-us.apache.org (ASF Mail Server at mx1-lw-us.apache.org) with ESMTPS id AD56B60CEA for ; Thu, 20 Oct 2016 10:35:30 +0000 (UTC) Received: by mail-qk0-f178.google.com with SMTP id n189so85195709qke.0 for ; Thu, 20 Oct 2016 03:35:30 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=fAARVymQnQzvEiEW7gL7sk2vFt6Tys6WQJbxMj1blKc=; b=po1KKeQXMH17VMaiosRBAU1FIU4wjIPwAQkZdGTe6axv9rzGhwR+3n2VsMjWE8XvS+ HNKTNFbkyqIY3CHXWFCYNNfoXtaJg5iBURhNfyvfJZkiZChaJzaT++KGjPVsvGY/Ld6p BP6AkXcReXFFS1Hocc9g1psFRd1ece4teY4PGh0s4xbm0ohrIqLUVMsFlvUT5Yn6xmti KoQTIv4AdgyuRqLJqrUffEg3LABxN+APG0WXCOcqkPK+xDmKDykYu3Vi6C0qOsDSJlyQ zSrQXl6IIv3xiF1Rc42324wF5ETmP5UhGt2+SzfvA+prUSf4UZN5tmIsmikqBHlx+938 YIPQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=fAARVymQnQzvEiEW7gL7sk2vFt6Tys6WQJbxMj1blKc=; b=Fi4BElGLqWfcJyvKbGiCDSggCC1QIDPtzPL119ik0VLl958ha1R11la1qmoqM0UVHa P5ck2gukD4ilg0JvbT3R7xpK9dG6YWxlWHk7ry235HvvBLUfe3+jmn8WikVroci2h4wU 1rtyxeBWU1RYy3avrr7WhjWJ8kjhqhB7J/Ry/PMyXrfE39PrYUk0K8UJv4tTz0ow6HYW u7HRQ0D5cBC+P2FsUwYc+5vuRRCLmeuVVJt+9IfzQXHsS5T8Ld2rTesWmkuScD7Dui73 EBxa5BW5RKMJX9YSCvMaIt3XR2YnkST35beJffVrP2NSUqS+SCY3Z6v4Xkc8R/9dII35 L9gw== X-Gm-Message-State: AA6/9RkdE+tXa6PcVTXouXi807wIvQ+9vRDzoYrbWfhgARhF8hkVyCnRReYfIfRoktiLlmAFAW3bqnacMkfiwQ== X-Received: by 10.194.16.161 with SMTP id h1mr7180809wjd.164.1476959428667; Thu, 20 Oct 2016 03:30:28 -0700 (PDT) MIME-Version: 1.0 References: <0072CAD8-3BAD-4AE9-91B9-5A6C18AED293@gmail.com> In-Reply-To: <0072CAD8-3BAD-4AE9-91B9-5A6C18AED293@gmail.com> From: Amit Sela Date: Thu, 20 Oct 2016 10:30:17 +0000 Message-ID: Subject: Re: StructuredStreaming status To: Matei Zaharia Cc: dev Content-Type: multipart/alternative; boundary=e89a8ff250b4181f4d053f496ab3 archived-at: Thu, 20 Oct 2016 10:35:42 -0000 --e89a8ff250b4181f4d053f496ab3 Content-Type: text/plain; charset=UTF-8 On Thu, Oct 20, 2016 at 7:40 AM Matei Zaharia wrote: > Yeah, as Shivaram pointed out, there have been research projects that > looked at it. Also, Structured Streaming was explicitly designed to not > make microbatching part of the API or part of the output behavior (tying > triggers to it). > But Streaming Query sources are still designed with microbatches in mind, can this be removed and leave offset tracking to the executors ? > However, when people begin working on that is a function of demand > relative to other features. I don't think we can commit to one plan before > exploring more options, but basically there is Shivaram's project, which > adds a few new concepts to the scheduler, and there's the option to reduce > control plane latency in the current system, which hasn't been heavily > optimized yet but should be doable (lots of systems can handle 10,000s of > RPCs per second). > > Matei > > On Oct 19, 2016, at 9:20 PM, Cody Koeninger wrote: > > I don't think it's just about what to target - if you could target 1ms > batches, without harming 1 second or 1 minute batches.... why wouldn't you? > I think it's about having a clear strategy and dedicating resources to it. > If scheduling batches at an order of magnitude or two lower latency is the > strategy, and that's actually feasible, that's great. But I haven't seen > that clear direction, and this is by no means a recent issue. > > On Oct 19, 2016 7:36 PM, "Matei Zaharia" wrote: > > I'm also curious whether there are concerns other than latency with the > way stuff executes in Structured Streaming (now that the time steps don't > have to act as triggers), as well as what latency people want for various > apps. > > The stateful operator designs for streaming systems aren't inherently > "better" than micro-batching -- they lose a lot of stuff that is possible > in Spark, such as load balancing work dynamically across nodes, speculative > execution for stragglers, scaling clusters up and down elastically, etc. > Moreover, Spark itself could execute the current model with much lower > latency. The question is just what combinations of latency, throughput, > fault recovery, etc to target. > > Matei > > On Oct 19, 2016, at 2:18 PM, Amit Sela wrote: > > > > On Thu, Oct 20, 2016 at 12:07 AM Shivaram Venkataraman < > shivaram@eecs.berkeley.edu> wrote: > > At the AMPLab we've been working on a research project that looks at > just the scheduling latencies and on techniques to get lower > scheduling latency. It moves away from the micro-batch model, but > reuses the fault tolerance etc. in Spark. However we haven't yet > figure out all the parts in integrating this with the rest of > structured streaming. I'll try to post a design doc / SIP about this > soon. > > On a related note - are there other problems users face with > micro-batch other than latency ? > > I think that the fact that they serve as an output trigger is a problem, > but Structured Streaming seems to resolve this now. > > > Thanks > Shivaram > > On Wed, Oct 19, 2016 at 1:29 PM, Michael Armbrust > wrote: > > I know people are seriously thinking about latency. So far that has not > > been the limiting factor in the users I've been working with. > > > > On Wed, Oct 19, 2016 at 1:11 PM, Cody Koeninger > wrote: > >> > >> Is anyone seriously thinking about alternatives to microbatches? > >> > >> On Wed, Oct 19, 2016 at 2:45 PM, Michael Armbrust > >> wrote: > >> > Anything that is actively being designed should be in JIRA, and it > seems > >> > like you found most of it. In general, release windows can be found > on > >> > the > >> > wiki. > >> > > >> > 2.1 has a lot of stability fixes as well as the kafka support you > >> > mentioned. > >> > It may also include some of the following. > >> > > >> > The items I'd like to start thinking about next are: > >> > - Evicting state from the store based on event time watermarks > >> > - Sessionization (grouping together related events by key / > eventTime) > >> > - Improvements to the query planner (remove some of the restrictions > on > >> > what queries can be run). > >> > > >> > This is roughly in order based on what I've been hearing users hit the > >> > most. > >> > Would love more feedback on what is blocking real use cases. > >> > > >> > On Tue, Oct 18, 2016 at 1:51 AM, Ofir Manor > >> > wrote: > >> >> > >> >> Hi, > >> >> I hope it is the right forum. > >> >> I am looking for some information of what to expect from > >> >> StructuredStreaming in its next releases to help me choose when / > where > >> >> to > >> >> start using it more seriously (or where to invest in workarounds and > >> >> where > >> >> to wait). I couldn't find a good place where such planning discussed > >> >> for 2.1 > >> >> (like, for example ML and SPARK-15581). > >> >> I'm aware of the 2.0 documented limits > >> >> > >> >> ( > http://spark.apache.org/docs/2.0.1/structured-streaming-programming-guide.html#unsupported-operations > ), > >> >> like no support for multiple aggregations levels, joins are strictly > to > >> >> a > >> >> static dataset (no SCD or stream-stream) etc, limited sources / sinks > >> >> (like > >> >> no sink for interactive queries) etc etc > >> >> I'm also aware of some changes that have landed in master, like the > new > >> >> Kafka 0.10 source (and its on-going improvements) in SPARK-15406, the > >> >> metrics in SPARK-17731, and some improvements for the file source. > >> >> If I remember correctly, the discussion on Spark release cadence > >> >> concluded > >> >> with a preference to a four-month cycles, with likely code freeze > >> >> pretty > >> >> soon (end of October). So I believe the scope for 2.1 should likely > >> >> quite > >> >> clear to some, and that 2.2 planning should likely be starting about > >> >> now. > >> >> Any visibility / sharing will be highly appreciated! > >> >> thanks in advance, > >> >> > >> >> Ofir Manor > >> >> > >> >> Co-Founder & CTO | Equalum > >> >> > >> >> Mobile: +972-54-7801286 <054-780-1286> | Email: > ofir.manor@equalum.io > >> > > >> > > > > > > > --------------------------------------------------------------------- > To unsubscribe e-mail: dev-unsubscribe@spark.apache.org > > > > --e89a8ff250b4181f4d053f496ab3 Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable


On Thu= , Oct 20, 2016 at 7:40 AM Matei Zaharia <matei.zaharia@gmail.com> wrote:
Yea= h, as Shivaram pointed out, there have been research projects that looked a= t it. Also, Structured Streaming was explicitly designed to not make microb= atching part of the API or part of the output behavior (tying triggers to i= t).
But Streaming Query sources=C2=A0are still designed wit= h microbatches in mind, can this be removed and leave offset tracking to th= e executors ?=C2=A0=C2=A0
However, when people begin worki= ng on that is a function of demand relative to other features. I don't = think we can commit to one plan before exploring more options, but basicall= y there is Shivaram's project, which adds a few new concepts to the sch= eduler, and there's the option to reduce control plane latency in the c= urrent system, which hasn't been heavily optimized yet but should be do= able (lots of systems can handle 10,000s of RPCs per second).
Matei

On Oct 19, 2016, at 9:20 PM, = Cody Koeninger <cody@koeninger.org> wrote:

I don't think it's just abo= ut what to target - if you could target 1ms batches, without harming 1 seco= nd or 1 minute batches.... why wouldn't you?
I think it's about having a clear strategy and dedicating resources to = it. If=C2=A0 scheduling batches at an order of magnitude or two lower laten= cy is the strategy, and that's actually feasible, that's great. But= I haven't seen that clear direction, and this is by no means a recent = issue.


On Oct 19, 2016 7:36 PM, "Matei Zaharia" &= lt;matei.zaharia@gmail.com> wrote:
I'm also curious whether there a= re concerns other than latency with the way stuff executes in Structured St= reaming (now that the time steps don't have to act as triggers), as wel= l as what latency people want for various apps.
The stateful operator d= esigns for streaming systems aren't inherently "better" than = micro-batching -- they lose a lot of stuff that is possible in Spark, such = as load balancing work dynamically across nodes, speculative execution for = stragglers, scaling clusters up and down elastically, etc. Moreover, Spark = itself could execute the current model with much lower latency. The questio= n is just what combinations of latency, throughput, fault recovery, etc to = target.

Matei
On Oct 19, 2016, at 2:18 PM, Amit= Sela <amitsela33@gmail.com> wrote:



On Thu, Oct 20, 2016 at 12:07 AM Shivaram Ven= kataraman <shivaram@eecs.berkeley.edu> wrote:
At the AMPLab= we've been working on a research project that looks at
just the scheduling latencies and on techniques to get lower
scheduling latency. It moves away from the micro-batch model, but
reuses the fault tolerance etc. in Spark. However we haven't yet
figure out all the parts in integrating this with the rest of
structured streaming. I'll try to post a design doc / SIP about this soon.

On a related note - are there other problems users face with
micro-batch other than latency ?
I th= ink that the fact that they serve as an output trigger is a problem, but St= ructured Streaming seems to resolve this now. =C2=A0

Thanks
Shivaram

On Wed, Oct 19, 2016 at 1:29 PM, Michael Armbrust
<michael@databr= icks.com> wrote:
> I know people are seriously thinking about latency.=C2=A0 So far that = has not
> been the limiting factor in the users I've been working with.
>
> On Wed, Oct 19, 2016 at 1:11 PM, Cody Koeninger <cody@koeninger.org> wrote:
>>
>> Is anyone seriously thinking about alternatives to microbatches? >>
>> On Wed, Oct 19, 2016 at 2:45 PM, Michael Armbrust
>> <micha= el@databricks.com> wrote:
>> > Anything that is actively being designed should be in JIRA, a= nd it seems
>> > like you found most of it.=C2=A0 In general, release windows = can be found on
>> > the
>> > wiki.
>> >
>> > 2.1 has a lot of stability fixes as well as the kafka support= you
>> > mentioned.
>> > It may also include some of the following.
>> >
>> > The items I'd like to start thinking about next are:
>> >=C2=A0 - Evicting state from the store based on event time wat= ermarks
>> >=C2=A0 - Sessionization (grouping together related events by k= ey / eventTime)
>> >=C2=A0 - Improvements to the query planner (remove some of the= restrictions on
>> > what queries can be run).
>> >
>> > This is roughly in order based on what I've been hearing = users hit the
>> > most.
>> > Would love more feedback on what is blocking real use cases.<= br class=3D"m_1478589370592673861m_8506420334745306420gmail_msg gmail_msg"> >> >
>> > On Tue, Oct 18, 2016 at 1:51 AM, Ofir Manor <ofir.manor@equalum.io> >> > wrote:
>> >>
>> >> Hi,
>> >> I hope it is the right forum.
>> >> I am looking for some information of what to expect from<= br class=3D"m_1478589370592673861m_8506420334745306420gmail_msg gmail_msg"> >> >> StructuredStreaming in its next releases to help me choos= e when / where
>> >> to
>> >> start using it more seriously (or where to invest in work= arounds and
>> >> where
>> >> to wait). I couldn't find a good place where such pla= nning discussed
>> >> for 2.1
>> >> (like, for example ML and SPARK-15581).
>> >> I'm aware of the 2.0 documented limits
>> >>
>> >> (http://spark.apache.org/docs/2.0.1/structured-streaming-pr= ogramming-guide.html#unsupported-operations),
>> >> like no support for multiple aggregations levels, joins a= re strictly to
>> >> a
>> >> static dataset (no SCD or stream-stream) etc, limited sou= rces / sinks
>> >> (like
>> >> no sink for interactive queries) etc etc
>> >> I'm also aware of some changes that have landed in ma= ster, like the new
>> >> Kafka 0.10 source (and its on-going improvements) in SPAR= K-15406, the
>> >> metrics in SPARK-17731, and some improvements for the fil= e source.
>> >> If I remember correctly, the discussion on Spark release = cadence
>> >> concluded
>> >> with a preference to a four-month cycles, with likely cod= e freeze
>> >> pretty
>> >> soon (end of October). So I believe the scope for 2.1 sho= uld likely
>> >> quite
>> >> clear to some, and that 2.2 planning should likely be sta= rting about
>> >> now.
>> >> Any visibility / sharing will be highly appreciated!
>> >> thanks in advance,
>> >>
>> >> Ofir Manor
>> >>
>> >> Co-Founder & CTO | Equalum
>> >>
>> >> Mobile: +972-54-7801286 | Email: ofir.manor@equalum.io
>> >
>> >
>
>

---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscribe@spark.apache.org



<= /div> --e89a8ff250b4181f4d053f496ab3--