spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Maciej Szymkiewicz <mszymkiew...@gmail.com>
Subject Re: [SQL][SPARK-14160] Maximum interval for o.a.s.sql.functions.window
Date Wed, 18 Jan 2017 17:52:19 GMT
Thanks for the response Burak,

As any sane person I try to steer away from the objects which have both
calendar and unsafe in their fully qualified names but if there is no
bigger picture I missed here I would go with 1 as well. And of course
fix the error message. I understand this has been introduced with
structured streaming in mind, but it is an useful feature in general,
not only in high precision scale. To be honest I would love to see some
generalized version which could be used (I mean without hacking) with
arbitrary numeric sequence. It could address at least some scenarios in
which people try to use window functions without PARTITION BY clause and
fail miserably.

Regarding ambiguity... Sticking with days doesn't really resolve the
problem, does it? If one were to nitpick it doesn't look like this
implementation even touches all the subtleties of DST or leap second.



On 01/18/2017 05:52 PM, Burak Yavuz wrote:
> Hi Maciej,
>
> I believe it would be useful to either fix the documentation or fix
> the implementation. I'll leave it to the community to comment on. The
> code right now disallows intervals provided in months and years,
> because they are not a "consistently" fixed amount of time. A month
> can be 28, 29, 30, or 31 days. A year is 12 months for sure, but is it
> 360 days (sometimes used in finance), 365 days or 366 days? 
>
> Therefore we could either:
>   1) Allow windowing when intervals are given in days and less, even
> though it could be 365 days, and fix the documentation.
>   2) Explicitly disallow it as there may be a lot of data for a given
> window, but partial aggregations should help with that.
>
> My thoughts are to go with 1. What do you think?
>
> Best,
> Burak
>
> On Wed, Jan 18, 2017 at 10:18 AM, Maciej Szymkiewicz
> <mszymkiewicz@gmail.com <mailto:mszymkiewicz@gmail.com>> wrote:
>
>     Hi,
>
>     Can I ask for some clarifications regarding intended behavior of
>     window / TimeWindow?
>
>     PySpark documentation states that "Windows in the order of months
>     are not supported". This is further confirmed by the checks in
>     TimeWindow.getIntervalInMicroseconds (https://git.io/vMP5l).
>
>     Surprisingly enough we can pass interval much larger than a month
>     by expressing interval in days or another unit of a higher
>     precision. So this fails:
>
>     Seq("2017-01-01").toDF("date").groupBy(window($"date", "1 month"))
>
>     while following is accepted:
>
>     Seq("2017-01-01").toDF("date").groupBy(window($"date", "999 days"))
>
>     with results which look sensible at first glance.
>
>     Is it a matter of a faulty validation logic (months will be
>     assigned only if there is a match against years or months
>     https://git.io/vMPdi) or expected behavior and I simply
>     misunderstood the intentions?
>
>     -- 
>     Best,
>     Maciej
>
>


Mime
View raw message