From dev-return-24430-archive-asf-public=cust-asf.ponee.io@spark.apache.org Mon Apr 9 14:06:47 2018 Return-Path: X-Original-To: archive-asf-public@cust-asf.ponee.io Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by mx-eu-01.ponee.io (Postfix) with SMTP id C20A0180645 for ; Mon, 9 Apr 2018 14:06:45 +0200 (CEST) Received: (qmail 16710 invoked by uid 500); 9 Apr 2018 12:06:44 -0000 Mailing-List: contact dev-help@spark.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list dev@spark.apache.org Received: (qmail 16695 invoked by uid 99); 9 Apr 2018 12:06:43 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd4-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 09 Apr 2018 12:06:43 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd4-us-west.apache.org (ASF Mail Server at spamd4-us-west.apache.org) with ESMTP id D0D12C0147 for ; Mon, 9 Apr 2018 12:06:42 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd4-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 1.879 X-Spam-Level: * X-Spam-Status: No, score=1.879 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, HTML_MESSAGE=2, RCVD_IN_DNSWL_NONE=-0.0001, RCVD_IN_MSPIKE_H3=-0.01, RCVD_IN_MSPIKE_WL=-0.01, SPF_PASS=-0.001] autolearn=disabled Authentication-Results: spamd4-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx1-lw-eu.apache.org ([10.40.0.8]) by localhost (spamd4-us-west.apache.org [10.40.0.11]) (amavisd-new, port 10024) with ESMTP id Kr3bXiZ6rgbV for ; Mon, 9 Apr 2018 12:06:39 +0000 (UTC) Received: from mail-qk0-f173.google.com (mail-qk0-f173.google.com [209.85.220.173]) by mx1-lw-eu.apache.org (ASF Mail Server at mx1-lw-eu.apache.org) with ESMTPS id C1E225F51F for ; Mon, 9 Apr 2018 12:06:38 +0000 (UTC) Received: by mail-qk0-f173.google.com with SMTP id j73so8905619qke.6 for ; Mon, 09 Apr 2018 05:06:38 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :cc; bh=FMdpD95n3+YQZQZoQ1jEHPDdQOGHaChpK7z8/cc8nRM=; b=K4ourgXeVeBseMKia/CQmjUdfn6Lx0FjWMdwbBy1mroMyz78MAXrsQ8a3jE/jxS+sa 8S0SoQmLMboSGqh+4wIjbXVnchxtYuu2OY3EGmwLLQLrdqlF8IG4vJrg1ZyUGsEIV+pH AIzxEKRS9alk8AmuiD/u7RV/cOjL6l2RXQBmVtUnO1unuCemVIPQSZ+iI5zx2ZzhtqGA hNjOwiqiUWGrwQAbTCGRUQ5z3VHrpsquNuKCpGFAkIENi0wVLCUlFHo4k7OiVdun3P02 TL1WjH9LgTfhcMuXWKVM1nQyfpQs2E4q/1y7hZLwwfyIVgT/r0PjuUZFA0szFcEr6NgY OusA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:in-reply-to:references:from:date :message-id:subject:to:cc; bh=FMdpD95n3+YQZQZoQ1jEHPDdQOGHaChpK7z8/cc8nRM=; b=l2/S1GjKof2AmX5jtPDnTZc0kNJRIPMOls/qGyafmXA/1s+4J1pbDooSwAWyzs+JJL q2IW41uo1reW9cWYuQKGjdD1pihadsXwvvenKnjuTKZd4G+blMN1Ixm3qxAER2+0UvBd tpnyyXETjAO3P92U2FlWb7phmYyxOh7/BSUM/Qp1EeitRg2X3Yvd6PS6t8mbP7ICUnM8 GwIHmW0fWg4vjvPcyizDcY/ctRx7mA+FLYjCREczxGJMGamyA45xVJ8SqOjc0QVuk1Si tB8lDFjnWTUaWvXmKLLauNOsoRSRB57ss+Tjsh+9GKPzYw7SQvcOOICneczZB6GvhYuL omEg== X-Gm-Message-State: ALQs6tDv7178cU00R4tLhGMFI3QqrjP7w5n8FO0JWjTp07B/t1P74oaL Zc3tP/z4g1bWsNCBJvLpU/KoJ2qpQkIFAhw/VgQ= X-Google-Smtp-Source: AIpwx4+V1liwt05frsuFBYlvZ+/EbHALrtUTAnYOhS/vGz14F10OUSb8nlcYKTPaP1llMn5WE216q8CrkF+6khDhPNA= X-Received: by 10.55.165.23 with SMTP id o23mr48985411qke.163.1523275597580; Mon, 09 Apr 2018 05:06:37 -0700 (PDT) MIME-Version: 1.0 Received: by 10.200.36.186 with HTTP; Mon, 9 Apr 2018 05:06:17 -0700 (PDT) In-Reply-To: References: From: Sandor Murakozi Date: Mon, 9 Apr 2018 14:06:17 +0200 Message-ID: Subject: Re: Clarify window behavior in Spark SQL To: Reynold Xin Cc: Li Jin , =?UTF-8?Q?Herman_van_H=C3=B6vell_tot_Westerflier?= , =?UTF-8?Q?Herman_van_H=C3=B6vell_tot_Westerflier?= , Xingbo Jiang , dev Content-Type: multipart/alternative; boundary="94eb2c0651e0e38ff50569693cfd" --94eb2c0651e0e38ff50569693cfd Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Hi Li, You might find my pending PR useful: https://github.com/apache/spark/pull/20045/files It contains a big bunch of test cases covering the windowing functionality, showing and checking the behavior of a number of special cases. On Wed, Apr 4, 2018 at 4:26 AM, Reynold Xin wrote: > Thanks Li! > > On Tue, Apr 3, 2018 at 7:23 PM Li Jin wrote: > >> Thanks all for the explanation. I am happy to update the API doc. >> >> https://issues.apache.org/jira/browse/SPARK-23861 >> >> On Tue, Apr 3, 2018 at 8:54 PM, Reynold Xin wrote: >> >>> Ah ok. Thanks for commenting. Everyday I learn something new about SQL. >>> >>> For others to follow, SQL Server has a good explanation of the behavior= : >>> https://docs.microsoft.com/en-us/sql/t-sql/queries/select-over-clause- >>> transact-sql >>> >>> >>> Can somebody (Li?) update the API documentation to specify the gotchas, >>> in case users are not familiar with SQL window function semantics? >>> >>> >>> >>> General Remarks >>> >>> >>> More than one window function can be used in a single query with a >>> single FROM clause. The OVER clause for each function can differ in >>> partitioning and ordering. >>> >>> If PARTITION BY is not specified, the function treats all rows of the >>> query result set as a single group. >>> Important! >>> >>> >>> If ROWS/RANGE is specified and is used for >>> (short syntax) then this specification is used fo= r >>> the window frame boundary starting point and CURRENT ROW is used for th= e >>> boundary ending point. For example =E2=80=9CROWS 5 PRECEDING=E2=80=9D i= s equal to =E2=80=9CROWS >>> BETWEEN 5 PRECEDING AND CURRENT ROW=E2=80=9D. >>> >>> Note+ >>> >>> If ORDER BY is not specified entire partition is used for a window >>> frame. This applies only to functions that do not require ORDER BY clau= se. >>> If ROWS/RANGE is not specified but ORDER BY is specified, RANGE UNBOUND= ED >>> PRECEDING AND CURRENT ROW is used as default for window frame. This app= lies >>> only to functions that have can accept optional ROWS/RANGE specificatio= n. >>> For example, ranking functions cannot accept ROWS/RANGE, therefore this >>> window frame is not applied even though ORDER BY is present and ROWS/RA= NGE >>> is not. >>> >>> >>> >>> >>> >>> On Tue, Apr 3, 2018 at 5:50 PM, Xingbo Jiang >>> wrote: >>> >>>> This is actually by design, without a `ORDER BY` clause, all rows are >>>> considered as the peer row of the current row, which means that the fr= ame >>>> is effectively the entire partition. This behavior follows the window >>>> syntax of PGSQL. >>>> You can refer to the comment by yhuai: https://github.com/ >>>> apache/spark/pull/5604#discussion_r157931911 >>>> :) >>>> >>>> 2018-04-04 6:27 GMT+08:00 Reynold Xin : >>>> >>>>> Do other (non-Hive) SQL systems do the same thing? >>>>> >>>>> On Tue, Apr 3, 2018 at 3:16 PM, Herman van H=C3=B6vell tot Westerflie= r < >>>>> herman@databricks.com> wrote: >>>>> >>>>>> This is something we inherited from Hive: https://cwiki.apache. >>>>>> org/confluence/display/Hive/LanguageManual+WindowingAndAnalytics >>>>>> >>>>>> When ORDER BY is specified with missing WINDOW clause, the WINDOW >>>>>>> specification defaults to RANGE BETWEEN UNBOUNDED PRECEDING AND >>>>>>> CURRENT ROW. >>>>>> >>>>>> When both ORDER BY and WINDOW clauses are missing, the WINDOW >>>>>>> specification defaults to ROW BETWEEN UNBOUNDED PRECEDING AND >>>>>>> UNBOUNDED FOLLOWING. >>>>>> >>>>>> >>>>>> It sort of makes sense if you think about it. If there is no orderin= g >>>>>> there is no way to have a bound frame. If there is ordering we defau= lt to >>>>>> the most commonly used deterministic frame. >>>>>> >>>>>> >>>>>> On Tue, Apr 3, 2018 at 11:09 PM, Reynold Xin >>>>>> wrote: >>>>>> >>>>>>> Seems like a bug. >>>>>>> >>>>>>> >>>>>>> >>>>>>> On Tue, Apr 3, 2018 at 1:26 PM, Li Jin >>>>>>> wrote: >>>>>>> >>>>>>>> Hi Devs, >>>>>>>> >>>>>>>> I am seeing some behavior with window functions that is a bit >>>>>>>> unintuitive and would like to get some clarification. >>>>>>>> >>>>>>>> When using aggregation function with window, the frame boundary >>>>>>>> seems to change depending on the order of the window. >>>>>>>> >>>>>>>> Example: >>>>>>>> (1) >>>>>>>> >>>>>>>> df =3D spark.createDataFrame([[0, 1], [0, 2], [0, 3]]).toDF('id', = 'v') >>>>>>>> >>>>>>>> w1 =3D Window.partitionBy('id') >>>>>>>> >>>>>>>> df.withColumn('v2', mean(df.v).over(w1)).show() >>>>>>>> >>>>>>>> +---+---+---+ >>>>>>>> >>>>>>>> | id| v| v2| >>>>>>>> >>>>>>>> +---+---+---+ >>>>>>>> >>>>>>>> | 0| 1|2.0| >>>>>>>> >>>>>>>> | 0| 2|2.0| >>>>>>>> >>>>>>>> | 0| 3|2.0| >>>>>>>> >>>>>>>> +---+---+---+ >>>>>>>> >>>>>>>> (2) >>>>>>>> df =3D spark.createDataFrame([[0, 1], [0, 2], [0, 3]]).toDF('id', = 'v') >>>>>>>> >>>>>>>> w2 =3D Window.partitionBy('id').orderBy('v') >>>>>>>> >>>>>>>> df.withColumn('v2', mean(df.v).over(w2)).show() >>>>>>>> >>>>>>>> +---+---+---+ >>>>>>>> >>>>>>>> | id| v| v2| >>>>>>>> >>>>>>>> +---+---+---+ >>>>>>>> >>>>>>>> | 0| 1|1.0| >>>>>>>> >>>>>>>> | 0| 2|1.5| >>>>>>>> >>>>>>>> | 0| 3|2.0| >>>>>>>> >>>>>>>> +---+---+---+ >>>>>>>> >>>>>>>> Seems like orderBy('v') in the example (2) also changes the frame >>>>>>>> boundaries from ( >>>>>>>> >>>>>>>> unboundedPreceding, unboundedFollowing) to (unboundedPreceding, >>>>>>>> currentRow). >>>>>>>> >>>>>>>> >>>>>>>> I found this behavior a bit unintuitive. I wonder if this behavior >>>>>>>> is by design and if so, what's the specific rule that orderBy() in= teracts >>>>>>>> with frame boundaries? >>>>>>>> >>>>>>>> >>>>>>>> Thanks, >>>>>>>> >>>>>>>> Li >>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >> --94eb2c0651e0e38ff50569693cfd Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable
Hi Li,
You might find my pending PR useful:=
https://= github.com/apache/spark/pull/20045/files

It contains a big bunch of test cases covering the windowing funct= ionality, showing and=C2=A0checking the=C2=A0behavior of a number of special cases.

On Wed, Apr 4, 2018 a= t 4:26 AM, Reynold Xin <rxin@databricks.com> wrote:
Thanks Li!

On Tue, Apr 3, 2018 at 7= :23 PM Li Jin <ice.xelloss@gmail.com> wrote:
Thanks all for the explanation. I am happy to update the API doc.<= div>

On Tue, Apr 3, 2018 at 8:54 PM, Reynold Xin <rxin@databricks.com> wrote:
Ah ok. Thanks for commentin= g. Everyday I learn something new about SQL.

For others = to follow, SQL Server has a good explanation of the behavior:=C2=A0https://docs.microsoft.com/en-us/sql/t-= sql/queries/select-over-clause-transact-sql


Can somebody (Li?) update the API documentation to s= pecify the gotchas, in case users are not familiar with SQL window function= semantics?



General Remarks

Mor= e than one window function can be used in a single query with a single FROM= clause. The OVER clause for each function can differ in partitioning and o= rdering.

If PARTITION BY is n= ot specified, the function treats all rows of the query result set as a sin= gle group.

Important!

=

If ROWS/RANGE= is specified and <window frame preceding> is used for <window fra= me extent> (short syntax) then this specification is used for the window= frame boundary starting point and CURRENT ROW is used for the boundary end= ing point. For example =E2=80=9CROWS 5 PRECEDING=E2=80=9D is equal to =E2= =80=9CROWS BETWEEN 5 PRECEDING AND CURRENT ROW=E2=80=9D.

No= te= +

If ORDER BY is not specified entire pa= rtition is used for a window frame. This applies only to functions that do = not require ORDER BY clause. If ROWS/RANGE is not specified but ORDER BY is= specified, RANGE UNBOUNDED PRECEDING AND CURRENT ROW is used as default fo= r window frame. This applies only to functions that have can accept optiona= l ROWS/RANGE specification. For example, ranking functions cannot accept RO= WS/RANGE, therefore this window frame is not applied even though ORDER BY i= s present and ROWS/RANGE is not.






On Tue, Apr 3, 2018 at 5:50 PM, Xingbo Jiang &l= t;jiangxb1987@gm= ail.com> wrote:
This i= s actually by design, without a `ORDER BY` clause, all rows are considered = as the peer row of the current row, which means that the frame is effective= ly the entire partition. This behavior follows the window syntax of PGSQL.<= div>You can refer to the comment by yhuai:=C2=A0https://= github.com/apache/spark/pull/5604#discussion_r157931911
=
:)

2018-04= -04 6:27 GMT+08:00 Reynold Xin <rxin@databricks.com>:
Do other (non-Hive) SQL systems do the same thi= ng?=C2=A0

On Tue, Apr 3, 2018 at 3:16 PM, Herman van H=C3=B6vell tot Westerflier = <herman= @databricks.com> wrote:

W= hen ORDER BY is specified with missing WINDOW clause, the WINDOW specificat= ion defaults to=C2=A0RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT RO= W.=C2=A0
When both ORDER= BY and WINDOW clauses are missing,=C2=A0the WINDOW specification defaults = to=C2=A0ROW BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING.

It sort of makes sense if you think about i= t. If there is no ordering there is no way to have a bound frame. If there = is ordering we default to the most commonly used deterministic frame.
=


On Tue, Apr 3,= 2018 at 11:09 PM, Reynold Xin <rxin@databricks.com> wrote:
Seems like a bug.



On Tue, Apr 3, 2018 at 1:26 PM, Li Jin <ice.xellos= s@gmail.com> wrote:
Hi= Devs,

I am seeing some behavior with window functions t= hat is a bit unintuitive and would like to get some clarification.

When using aggregation function with window, the frame bou= ndary seems to change depending on the order of the window.

<= /div>
Example:
(1)

df =3D spark.createDataFrame([[0, 1], [0, 2], [0, 3]]).toDF= ('id', 'v')

w1 =3D Window.partitionBy('id')

df.withColumn('v2', mean(df.v).over(w1)).show()

+---+---+---+

| id|=C2=A0 v| v2|

+---+---+---+

|=C2=A0 0|=C2=A0 = 1|2.0|

|=C2=A0 0|=C2=A0 = 2|2.0|

|=C2=A0 0|=C2=A0 = 3|2.0|

+---+---+---+


(2)
=
df =3D spark.createDataFrame([[0, 1], [0, 2], = [0, 3]]).toDF('id', 'v')

w2 =3D Window.partitionBy('id').orderBy('v= ')

df.withColumn('v2', mean(df.v).over(w2)).show()

+---+---+---+

| id|=C2=A0 v| v2|

+---+---+---+

|=C2=A0 0|=C2=A0 = 1|1.0|

|=C2=A0 0|=C2=A0 = 2|1.5|

|=C2=A0 0|=C2=A0 = 3|2.0|

+---+---+---+


Seems like orderBy('v') in the example (2) also changes the frame = boundaries from (

unboundedPreceding, unboundedFollowing) to (unboundedPreceding, curr= entRow).


I found this behavior a bit unintuitive. I wonder if this be= havior is by design and if so, what's the specific rule that orderBy() = interacts with frame boundaries?


<= /span>

Thanks,

Li









--94eb2c0651e0e38ff50569693cfd--