Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id E0381200BB4 for ; Tue, 1 Nov 2016 21:22:37 +0100 (CET) Received: by cust-asf.ponee.io (Postfix) id DEC74160ADA; Tue, 1 Nov 2016 20:22:37 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 07731160AF7 for ; Tue, 1 Nov 2016 21:22:36 +0100 (CET) Received: (qmail 65320 invoked by uid 500); 1 Nov 2016 20:22:35 -0000 Mailing-List: contact dev-help@spark.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list dev@spark.apache.org Received: (qmail 65302 invoked by uid 99); 1 Nov 2016 20:22:34 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd4-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 01 Nov 2016 20:22:34 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd4-us-west.apache.org (ASF Mail Server at spamd4-us-west.apache.org) with ESMTP id 3A1CAC07CC for ; Tue, 1 Nov 2016 20:22:34 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd4-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 2.38 X-Spam-Level: ** X-Spam-Status: No, score=2.38 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, HTML_MESSAGE=2, RCVD_IN_DNSWL_NONE=-0.0001, RCVD_IN_MSPIKE_H3=-0.01, RCVD_IN_MSPIKE_WL=-0.01, RCVD_IN_SORBS_SPAM=0.5, SPF_PASS=-0.001, URIBL_BLOCKED=0.001] autolearn=disabled Authentication-Results: spamd4-us-west.apache.org (amavisd-new); dkim=pass (1024-bit key) header.d=netflix.com Received: from mx1-lw-eu.apache.org ([10.40.0.8]) by localhost (spamd4-us-west.apache.org [10.40.0.11]) (amavisd-new, port 10024) with ESMTP id tceTCCxlJAsB for ; Tue, 1 Nov 2016 20:22:32 +0000 (UTC) Received: from mail-wm0-f47.google.com (mail-wm0-f47.google.com [74.125.82.47]) by mx1-lw-eu.apache.org (ASF Mail Server at mx1-lw-eu.apache.org) with ESMTPS id 9135B5FB6E for ; Tue, 1 Nov 2016 20:22:31 +0000 (UTC) Received: by mail-wm0-f47.google.com with SMTP id a197so99934755wmd.0 for ; Tue, 01 Nov 2016 13:22:31 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:in-reply-to:references:from:date :message-id:subject:to:cc; bh=R68uWk9r79G7IDYV1v4i2kiYRfDFOa1nTEiPBFvCznY=; b=b4PROVoafl1jA3rtIc5AG7rT84aA4r5iPDA1d5Q03ZGf62tNtLugqvjGQmQwFXn7wG C+JqdeCannKNW4Atu1mUpbxwPe4nASrRmlvh0aBnWy1v7Fm9ynYmm1sKoKnZCXDLRtVi Jg88x4ilJJ40PBvmlCpL2W42EkBYH5n87Ebh0YC3Tz+yTlVTm4bqGmNCtwShF03xNX4T 8TZNiqiyW9sNGsPw+eqSkLN4kg1IIGpzRASA6SAvWAJd0XEGj82Q4t0YOhyqMZKWxVC6 KoI02nfwN2WlxUWbx7ZoTPpqM2OJg22Xukiat4PxJwjLoc9QyenTfV38TCgfF1114fyu dxHA== X-Gm-Message-State: ABUngveIamE5Y+Yj7YCvXw7gda7venJvGH1Z2bu0TPazZ9w4jahTo1BgSU+NPK2E9tqGl1IYhIEi+kNJSNKR0FIl X-Received: by 10.194.19.130 with SMTP id f2mr602369wje.107.1478031751164; Tue, 01 Nov 2016 13:22:31 -0700 (PDT) MIME-Version: 1.0 Received: by 10.194.30.1 with HTTP; Tue, 1 Nov 2016 13:22:00 -0700 (PDT) In-Reply-To: References: <001CA8D2-234C-4C13-B624-9FC85969A447@videoamp.com> From: Ryan Blue Date: Tue, 1 Nov 2016 13:22:00 -0700 Message-ID: Subject: Re: Updating Parquet dep to 1.9 To: Reynold Xin Cc: Ryan Blue , Sean Owen , Michael Allman , Spark Dev List Content-Type: multipart/alternative; boundary=047d7b5d4dae7ed4be05404315a6 archived-at: Tue, 01 Nov 2016 20:22:38 -0000 --047d7b5d4dae7ed4be05404315a6 Content-Type: text/plain; charset=UTF-8 I can when I'm finished with a couple other issues if no one gets to it first. Michael, if you're interested in updating to 1.9.0 I'm happy to help review that PR. On Tue, Nov 1, 2016 at 1:03 PM, Reynold Xin wrote: > Ryan want to submit a pull request? > > > On Tue, Nov 1, 2016 at 9:05 AM, Ryan Blue > wrote: > >> 1.9.0 includes some fixes intended specifically for Spark: >> >> * PARQUET-389: Evaluates push-down predicates for missing columns as >> though they are null. This is to address Spark's work-around that requires >> reading and merging file schemas, even for metastore tables. >> * PARQUET-654: Adds an option to disable record-level predicate >> push-down, but keep row group evaluation. This allows Spark to skip row >> groups based on stats and dictionaries, but implement its own vectorized >> record filtering. >> >> The Parquet community also evaluated performance to ensure no performance >> regressions from moving to the ByteBuffer read path. >> >> There is one concern about 1.9.0 that will be addressed in 1.9.1, which >> is that stats calculations were incorrectly using unsigned byte order for >> string comparison. This means that min/max stats can't be used if the data >> contains (or may contain) UTF8 characters with the msb set. 1.9.0 won't >> return the bad min/max values for correctness, but there is a property to >> override this behavior for data that doesn't use the affected code points. >> >> Upgrading to 1.9.0 depends on how the community wants to handle the sort >> order bug: whether correctness or performance should be the default. >> >> rb >> >> On Tue, Nov 1, 2016 at 2:22 AM, Sean Owen wrote: >> >>> Yes this came up from a different direction: https://issues.apac >>> he.org/jira/browse/SPARK-18140 >>> >>> I think it's fine to pursue an upgrade to fix these several issues. The >>> question is just how well it will play with other components, so bears some >>> testing and evaluation of the changes from 1.8, but yes this would be good. >>> >>> On Mon, Oct 31, 2016 at 9:07 PM Michael Allman >>> wrote: >>> >>>> Hi All, >>>> >>>> Is anyone working on updating Spark's Parquet library dep to 1.9? If >>>> not, I can at least get started on it and publish a PR. >>>> >>>> Cheers, >>>> >>>> Michael >>>> --------------------------------------------------------------------- >>>> To unsubscribe e-mail: dev-unsubscribe@spark.apache.org >>>> >>>> >> >> >> -- >> Ryan Blue >> Software Engineer >> Netflix >> > > -- Ryan Blue Software Engineer Netflix --047d7b5d4dae7ed4be05404315a6 Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable
I can when I'm finished with a couple other issues if = no one gets to it first.

Michael, if you're interest= ed in updating to 1.9.0 I'm happy to help review that PR.

On Tue, Nov 1, 2016= at 1:03 PM, Reynold Xin <rxin@databricks.com> wrote:
<= blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px= #ccc solid;padding-left:1ex">
Ryan want to submit a pull r= equest?


On Tue, Nov 1, 2016 at 9= :05 AM, Ryan Blue <rblue@netflix.com.invalid> wrote:=
1.9.0 includes some fix= es intended specifically for Spark:

* PARQUET-389: Evalu= ates push-down predicates for missing columns as though they are null. This= is to address Spark's work-around that requires reading and merging fi= le schemas, even for metastore tables.
* PARQUET-654: Adds an opt= ion to disable record-level predicate push-down, but keep row group evaluat= ion. This allows Spark to skip row groups based on stats and dictionaries, = but implement its own vectorized record filtering.

The Parquet community also evaluated performance to ensure no performance = regressions from moving to the ByteBuffer read path.

There is one concern about 1.9.0 that will be addressed in 1.9.1, which = is that stats calculations were incorrectly using unsigned byte order for s= tring comparison. This means that min/max stats can't be used if the da= ta contains (or may contain) UTF8 characters with the msb set. 1.9.0 won= 9;t return the bad min/max values for correctness, but there is a property = to override this behavior for data that doesn't use the affected code p= oints.

Upgrading to 1.9.0 depends on how the commu= nity wants to handle the sort order bug: whether correctness or performance= should be the default.

rb

On Tue, Nov 1, 2016 at 2:22 AM, Sean Owen <sowen@cloude= ra.com> wrote:
Yes this came up from a different direction:=C2=A0https://issues.= apache.org/jira/browse/SPARK-18140

I think it&#= 39;s fine to pursue an upgrade to fix these several issues. The question is= just how well it will play with other components, so bears some testing an= d evaluation of the changes from 1.8, but yes this would be good.

On Mon, Oct 31, 2016 at 9:07 PM Michael Allman &l= t;michael@videoam= p.com> wrote:
Hi All,

Is anyone working on updating Spark's Parquet library dep to 1.9? If no= t, I can at least get started on it and publish a PR.

Cheers,

Michael
-----------------------------------------------------------------= ----
To unsubscribe e-mail: dev-unsubscribe@spark.apache.org




<= /div>-= -
Ryan Blue
Software Engineer
Netflix




--
=
Ryan Blue
Software Engineer
Netflix
--047d7b5d4dae7ed4be05404315a6--