spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Xiao Li <gatorsm...@gmail.com>
Subject Re: Maintenance releases for SPARK-23852?
Date Mon, 16 Apr 2018 22:01:37 GMT
Yes, it sounds good to me. We can upgrade both Parquet 1.8.2 to 1.8.3 and
ORC 1.4.1 to 1.4.3 in our upcoming Spark 2.3.1 release.

Thanks for your efforts! @Henry and @Dongjoon

Xiao

2018-04-16 14:41 GMT-07:00 Henry Robinson <henry@apache.org>:

> Seems like there aren't any objections. I'll pick this thread back up when
> a Parquet maintenance release has happened.
>
> Henry
>
> On 11 April 2018 at 14:00, Dongjoon Hyun <dongjoon.hyun@gmail.com> wrote:
>
>> Great.
>>
>> If we can upgrade the parquet dependency from 1.8.2 to 1.8.3 in Apache
>> Spark 2.3.1, let's upgrade orc dependency from 1.4.1 to 1.4.3 together.
>>
>> Currently, the patch is only merged into master branch now. 1.4.1 has the
>> following issue.
>>
>> https://issues.apache.org/jira/browse/SPARK-23340
>>
>> Bests,
>> Dongjoon.
>>
>>
>>
>> On Wed, Apr 11, 2018 at 1:23 PM, Reynold Xin <rxin@databricks.com> wrote:
>>
>>> Seems like this would make sense... we usually make maintenance releases
>>> for bug fixes after a month anyway.
>>>
>>>
>>> On Wed, Apr 11, 2018 at 12:52 PM, Henry Robinson <henry@apache.org>
>>> wrote:
>>>
>>>>
>>>>
>>>> On 11 April 2018 at 12:47, Ryan Blue <rblue@netflix.com.invalid> wrote:
>>>>
>>>>> I think a 1.8.3 Parquet release makes sense for the 2.3.x releases of
>>>>> Spark.
>>>>>
>>>>> To be clear though, this only affects Spark when reading data written
>>>>> by Impala, right? Or does Parquet CPP also produce data like this?
>>>>>
>>>>
>>>> I don't know about parquet-cpp, but yeah, the only implementation I've
>>>> seen writing the half-completed stats is Impala. (as you know, that's
>>>> compliant with the spec, just an unusual choice).
>>>>
>>>>
>>>>>
>>>>> On Wed, Apr 11, 2018 at 12:35 PM, Henry Robinson <henry@apache.org>
>>>>> wrote:
>>>>>
>>>>>> Hi all -
>>>>>>
>>>>>> SPARK-23852 (where a query can silently give wrong results thanks
to
>>>>>> a predicate pushdown bug in Parquet) is a fairly bad bug. In other
projects
>>>>>> I've been involved with, we've released maintenance releases for
bugs of
>>>>>> this severity.
>>>>>>
>>>>>> Since Spark 2.4.0 is probably a while away, I wanted to see if there
>>>>>> was any consensus over whether we should consider (at least) a 2.3.1.
>>>>>>
>>>>>> The reason this particular issue is a bit tricky is that the Parquet
>>>>>> community haven't yet produced a maintenance release that fixes the
>>>>>> underlying bug, but they are in the process of releasing a new minor
>>>>>> version, 1.10, which includes a fix. Having spoken to a couple of
Parquet
>>>>>> developers, they'd be willing to consider a maintenance release,
but would
>>>>>> probably only bother if we (or another affected project) asked them
to.
>>>>>>
>>>>>> My guess is that we wouldn't want to upgrade to a new minor version
>>>>>> of Parquet for a Spark maintenance release, so asking for a Parquet
>>>>>> maintenance release makes sense.
>>>>>>
>>>>>> What does everyone think?
>>>>>>
>>>>>> Best,
>>>>>> Henry
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Ryan Blue
>>>>> Software Engineer
>>>>> Netflix
>>>>>
>>>>
>>>>
>>>
>>
>

Mime
View raw message