From dev-return-24466-archive-asf-public=cust-asf.ponee.io@spark.apache.org Tue Apr 17 00:01:44 2018 Return-Path: X-Original-To: archive-asf-public@cust-asf.ponee.io Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by mx-eu-01.ponee.io (Postfix) with SMTP id 04C24180608 for ; Tue, 17 Apr 2018 00:01:43 +0200 (CEST) Received: (qmail 77344 invoked by uid 500); 16 Apr 2018 22:01:42 -0000 Mailing-List: contact dev-help@spark.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list dev@spark.apache.org Received: (qmail 77321 invoked by uid 99); 16 Apr 2018 22:01:41 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd2-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 16 Apr 2018 22:01:41 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd2-us-west.apache.org (ASF Mail Server at spamd2-us-west.apache.org) with ESMTP id 2C9B71A1C6E for ; Mon, 16 Apr 2018 22:01:41 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd2-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 1.899 X-Spam-Level: * X-Spam-Status: No, score=1.899 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, HTML_MESSAGE=2, RCVD_IN_DNSWL_NONE=-0.0001, SPF_PASS=-0.001] autolearn=disabled Authentication-Results: spamd2-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx1-lw-us.apache.org ([10.40.0.8]) by localhost (spamd2-us-west.apache.org [10.40.0.9]) (amavisd-new, port 10024) with ESMTP id zbrFc1FDt2ZC for ; Mon, 16 Apr 2018 22:01:39 +0000 (UTC) Received: from mail-vk0-f50.google.com (mail-vk0-f50.google.com [209.85.213.50]) by mx1-lw-us.apache.org (ASF Mail Server at mx1-lw-us.apache.org) with ESMTPS id 90A925F3CE for ; Mon, 16 Apr 2018 22:01:39 +0000 (UTC) Received: by mail-vk0-f50.google.com with SMTP id h134so10482492vke.2 for ; Mon, 16 Apr 2018 15:01:39 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :cc; bh=uce3NipFC9dwwV0ytlBtxUZUB9cQPhjdElqptR2isqI=; b=Gmysk96SmIFqnBzlbFRX4de44hb19QbJ2Aqw7kgeGYHFKfqJ9sgbwxYSKCuHEqXD0Q 4UnYBv96pQ6jbZm8fgj8ZY3AM5OqcXtC+HxQ989VTXhBvKkBzImbaVHVMldLzM0qPf6P LuY5PxEP50OL6pJMgL0zEUVNOKj5CAsnrOcVUW7uSc262FOr3J97txxdSkK9uRQj2XX+ jEKVbPmxR0w2BFAvWbTcb8PmIPXP8KkevaA9IdCy3KlfFReHs4lJuyzYCuhlRkj0ZtkV ODL2t1OEhjAifGaMACfY0ekFOTxK3KQdikDF+4eS8GIMTquQj4EdAOsKKl1xayaL04cd 2HQQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:in-reply-to:references:from:date :message-id:subject:to:cc; bh=uce3NipFC9dwwV0ytlBtxUZUB9cQPhjdElqptR2isqI=; b=nUVj4jXd/26iNtkJuXoazExu3JRbWAxCvDogaJ1esTGheJykfhSm+gfZIBYiWta+Dk 6a9qD6U2tNm+Ezx6EbfYGcalKUu5WgADWflXRUUASXrJO36zFNvxWRkaFVbXlI9UlCod JINy1NXh3Yu05/9b/ZxZKWY3qQyZNntTPD7aWiMCdVyl+2TV8TeDk53su3AxDZi+Ee/Y jUUFoN5Mn+fkA7xQEYUzr2I/FqVRcvbFsH2xmYOQwVAUYAM3GBka99DMjRNq7GWqQYX7 Ck+cg+XtFuw3WGv/ZVLbCsWNhuuxC5H/WoaXNBxlb7OS2sSm7O0dnrOb2SI2XxT+IIbl iqQw== X-Gm-Message-State: ALQs6tBc8hMikVQQ9JAcJdRou5E+jmWist/TdVDjfl/xINq3TTl+TuWZ bCCPjSDDhjBi7FZpxiOThuzHFQVJvIGyf4maaOY= X-Google-Smtp-Source: AIpwx4+hpywGNpyuSA10lNGU6JX4Qc+mQO0UXjjkfWKOx4jrt3HmmCX6iwV1gnFefTjmPAqcCpD85AJDos3QAJ8PDcg= X-Received: by 10.31.59.13 with SMTP id i13mr7518387vka.43.1523916098207; Mon, 16 Apr 2018 15:01:38 -0700 (PDT) MIME-Version: 1.0 Received: by 10.103.220.20 with HTTP; Mon, 16 Apr 2018 15:01:37 -0700 (PDT) In-Reply-To: References: From: Xiao Li Date: Mon, 16 Apr 2018 15:01:37 -0700 Message-ID: Subject: Re: Maintenance releases for SPARK-23852? To: Henry Robinson Cc: Dongjoon Hyun , Reynold Xin , Ryan Blue , Spark dev list Content-Type: multipart/alternative; boundary="001a114301e2b386ea0569fe5db0" --001a114301e2b386ea0569fe5db0 Content-Type: text/plain; charset="UTF-8" Yes, it sounds good to me. We can upgrade both Parquet 1.8.2 to 1.8.3 and ORC 1.4.1 to 1.4.3 in our upcoming Spark 2.3.1 release. Thanks for your efforts! @Henry and @Dongjoon Xiao 2018-04-16 14:41 GMT-07:00 Henry Robinson : > Seems like there aren't any objections. I'll pick this thread back up when > a Parquet maintenance release has happened. > > Henry > > On 11 April 2018 at 14:00, Dongjoon Hyun wrote: > >> Great. >> >> If we can upgrade the parquet dependency from 1.8.2 to 1.8.3 in Apache >> Spark 2.3.1, let's upgrade orc dependency from 1.4.1 to 1.4.3 together. >> >> Currently, the patch is only merged into master branch now. 1.4.1 has the >> following issue. >> >> https://issues.apache.org/jira/browse/SPARK-23340 >> >> Bests, >> Dongjoon. >> >> >> >> On Wed, Apr 11, 2018 at 1:23 PM, Reynold Xin wrote: >> >>> Seems like this would make sense... we usually make maintenance releases >>> for bug fixes after a month anyway. >>> >>> >>> On Wed, Apr 11, 2018 at 12:52 PM, Henry Robinson >>> wrote: >>> >>>> >>>> >>>> On 11 April 2018 at 12:47, Ryan Blue wrote: >>>> >>>>> I think a 1.8.3 Parquet release makes sense for the 2.3.x releases of >>>>> Spark. >>>>> >>>>> To be clear though, this only affects Spark when reading data written >>>>> by Impala, right? Or does Parquet CPP also produce data like this? >>>>> >>>> >>>> I don't know about parquet-cpp, but yeah, the only implementation I've >>>> seen writing the half-completed stats is Impala. (as you know, that's >>>> compliant with the spec, just an unusual choice). >>>> >>>> >>>>> >>>>> On Wed, Apr 11, 2018 at 12:35 PM, Henry Robinson >>>>> wrote: >>>>> >>>>>> Hi all - >>>>>> >>>>>> SPARK-23852 (where a query can silently give wrong results thanks to >>>>>> a predicate pushdown bug in Parquet) is a fairly bad bug. In other projects >>>>>> I've been involved with, we've released maintenance releases for bugs of >>>>>> this severity. >>>>>> >>>>>> Since Spark 2.4.0 is probably a while away, I wanted to see if there >>>>>> was any consensus over whether we should consider (at least) a 2.3.1. >>>>>> >>>>>> The reason this particular issue is a bit tricky is that the Parquet >>>>>> community haven't yet produced a maintenance release that fixes the >>>>>> underlying bug, but they are in the process of releasing a new minor >>>>>> version, 1.10, which includes a fix. Having spoken to a couple of Parquet >>>>>> developers, they'd be willing to consider a maintenance release, but would >>>>>> probably only bother if we (or another affected project) asked them to. >>>>>> >>>>>> My guess is that we wouldn't want to upgrade to a new minor version >>>>>> of Parquet for a Spark maintenance release, so asking for a Parquet >>>>>> maintenance release makes sense. >>>>>> >>>>>> What does everyone think? >>>>>> >>>>>> Best, >>>>>> Henry >>>>>> >>>>> >>>>> >>>>> >>>>> -- >>>>> Ryan Blue >>>>> Software Engineer >>>>> Netflix >>>>> >>>> >>>> >>> >> > --001a114301e2b386ea0569fe5db0 Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable
Yes, it sounds good to me. We can upgrade both Parquet 1.8= .2 to 1.8.3 and ORC 1.4.1 to 1.4.3 in our upcoming Spark 2.3.1 release.=C2= =A0

Thanks for your efforts! @Henry and @Dongjoon
<= div>
Xiao

2018-04-16 14:41 GMT-07:00 Henry Robinson <henry@apache.o= rg>:
Seems= like there aren't any objections. I'll pick this thread back up wh= en a Parquet maintenance release has happened.

Henry

On 11 April 2018 at 14:00, Dongjoon Hyun <dongjoon.= hyun@gmail.com> wrote:
Great.

If we can upgrade the parquet de= pendency from 1.8.2 to 1.8.3 in Apache Spark 2.3.1, let's upgrade orc d= ependency from 1.4.1 to 1.4.3 together.

Currently, the patch i= s only merged into master branch now. 1.4.1 has the following issue.

https://issues.apache.org/jira/browse/SPARK-23340
<= br>
Bests,
Dongjoon.



On Wed, Apr 11, 2018 at 1:= 23 PM, Reynold Xin <rxin@databricks.com> wrote:
Seems like this would make sense..= . we usually make maintenance releases for bug fixes after a month anyway.<= div>

On Wed, Apr 11, 201= 8 at 12:52 PM, Henry Robinson <henry@apache.org> wrote:


On 11 April 2018 at 12:47, Ryan Blue= <rblue@netflix.com.invalid> wrote:
I think a 1.8.3 Parquet release makes s= ense for the 2.3.x releases of Spark.

To be clear tho= ugh, this only affects Spark when reading data written by Impala, right? Or= does Parquet CPP also produce data like this?

I don't know about parquet-cpp, but yeah, the on= ly implementation I've seen writing the half-completed stats is Impala.= (as you know, that's compliant with the spec, just an unusual choice).= =C2=A0
=C2=A0

On Wed, Apr 11, 2018 at 12:35 PM, Henry Robinson <henry@apache.or= g> wrote:
= Hi all -=C2=A0

SPARK-23852 (where a query can silently g= ive wrong results thanks to a predicate pushdown bug in Parquet) is a fairl= y bad bug. In other projects I've been involved with, we've release= d maintenance releases for bugs of this severity.

= Since Spark 2.4.0 is probably a while away, I wanted to see if there was an= y consensus over whether we should consider (at least) a 2.3.1.
<= br>
The reason this particular issue is a bit tricky is that the = Parquet community haven't yet produced a maintenance release that fixes= the underlying bug, but they are in the process of releasing a new minor v= ersion, 1.10, which includes a fix. Having spoken to a couple of Parquet de= velopers, they'd be willing to consider a maintenance release, but woul= d probably only bother if we (or another affected project) asked them to.= =C2=A0

My guess is that we wouldn't want to up= grade to a new minor version of Parquet for a Spark maintenance release, so= asking for a Parquet maintenance release makes sense.=C2=A0

=
What does everyone think?

Best,
Henry



--
Ryan Blue
Software Engin= eer
Netflix





--001a114301e2b386ea0569fe5db0--