spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Li Jin <ice.xell...@gmail.com>
Subject Re: Clarify window behavior in Spark SQL
Date Tue, 03 Apr 2018 22:08:16 GMT
Here is the original code and comments:
https://github.com/apache/spark/commit/b6b50efc854f298d5b3e11c05dca995a85bec962#diff-4a8f00ca33a80744965463dcc6662c75L277

Seems this is intentional. Although I am not really sure why - maybe to
match other SQL systems behavior?

On Tue, Apr 3, 2018 at 5:09 PM, Reynold Xin <rxin@databricks.com> wrote:

> Seems like a bug.
>
>
>
> On Tue, Apr 3, 2018 at 1:26 PM, Li Jin <ice.xelloss@gmail.com> wrote:
>
>> Hi Devs,
>>
>> I am seeing some behavior with window functions that is a bit unintuitive
>> and would like to get some clarification.
>>
>> When using aggregation function with window, the frame boundary seems to
>> change depending on the order of the window.
>>
>> Example:
>> (1)
>>
>> df = spark.createDataFrame([[0, 1], [0, 2], [0, 3]]).toDF('id', 'v')
>>
>> w1 = Window.partitionBy('id')
>>
>> df.withColumn('v2', mean(df.v).over(w1)).show()
>>
>> +---+---+---+
>>
>> | id|  v| v2|
>>
>> +---+---+---+
>>
>> |  0|  1|2.0|
>>
>> |  0|  2|2.0|
>>
>> |  0|  3|2.0|
>>
>> +---+---+---+
>>
>> (2)
>> df = spark.createDataFrame([[0, 1], [0, 2], [0, 3]]).toDF('id', 'v')
>>
>> w2 = Window.partitionBy('id').orderBy('v')
>>
>> df.withColumn('v2', mean(df.v).over(w2)).show()
>>
>> +---+---+---+
>>
>> | id|  v| v2|
>>
>> +---+---+---+
>>
>> |  0|  1|1.0|
>>
>> |  0|  2|1.5|
>>
>> |  0|  3|2.0|
>>
>> +---+---+---+
>>
>> Seems like orderBy('v') in the example (2) also changes the frame
>> boundaries from (
>>
>> unboundedPreceding, unboundedFollowing) to (unboundedPreceding,
>> currentRow).
>>
>>
>> I found this behavior a bit unintuitive. I wonder if this behavior is by
>> design and if so, what's the specific rule that orderBy() interacts with
>> frame boundaries?
>>
>>
>> Thanks,
>>
>> Li
>>
>>
>

Mime
View raw message