spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael Allman <mich...@videoamp.com>
Subject Re: Updating Parquet dep to 1.9
Date Wed, 02 Nov 2016 15:31:31 GMT
Sounds great. Regarding the min/max stats issue, is that an issue with the way the files are
written or read? What's the Parquet project issue for that bug? What's the 1.9.1 release timeline
look like?

I will aim to have a PR in by the end of the week. I feel strongly that either this or https://github.com/apache/spark/pull/15538
<https://github.com/apache/spark/pull/15538> needs to make it into 2.1. The logging
output issue is really bad. I would probably call it a blocker.

Michael


> On Nov 1, 2016, at 1:22 PM, Ryan Blue <rblue@netflix.com> wrote:
> 
> I can when I'm finished with a couple other issues if no one gets to it first.
> 
> Michael, if you're interested in updating to 1.9.0 I'm happy to help review that PR.
> 
> On Tue, Nov 1, 2016 at 1:03 PM, Reynold Xin <rxin@databricks.com <mailto:rxin@databricks.com>>
wrote:
> Ryan want to submit a pull request?
> 
> 
> On Tue, Nov 1, 2016 at 9:05 AM, Ryan Blue <rblue@netflix.com.invalid <mailto:rblue@netflix.com.invalid>>
wrote:
> 1.9.0 includes some fixes intended specifically for Spark:
> 
> * PARQUET-389: Evaluates push-down predicates for missing columns as though they are
null. This is to address Spark's work-around that requires reading and merging file schemas,
even for metastore tables.
> * PARQUET-654: Adds an option to disable record-level predicate push-down, but keep row
group evaluation. This allows Spark to skip row groups based on stats and dictionaries, but
implement its own vectorized record filtering.
> 
> The Parquet community also evaluated performance to ensure no performance regressions
from moving to the ByteBuffer read path.
> 
> There is one concern about 1.9.0 that will be addressed in 1.9.1, which is that stats
calculations were incorrectly using unsigned byte order for string comparison. This means
that min/max stats can't be used if the data contains (or may contain) UTF8 characters with
the msb set. 1.9.0 won't return the bad min/max values for correctness, but there is a property
to override this behavior for data that doesn't use the affected code points.
> 
> Upgrading to 1.9.0 depends on how the community wants to handle the sort order bug: whether
correctness or performance should be the default.
> 
> rb
> 
> On Tue, Nov 1, 2016 at 2:22 AM, Sean Owen <sowen@cloudera.com <mailto:sowen@cloudera.com>>
wrote:
> Yes this came up from a different direction: https://issues.apache.org/jira/browse/SPARK-18140
<https://issues.apache.org/jira/browse/SPARK-18140>
> 
> I think it's fine to pursue an upgrade to fix these several issues. The question is just
how well it will play with other components, so bears some testing and evaluation of the changes
from 1.8, but yes this would be good.
> 
> On Mon, Oct 31, 2016 at 9:07 PM Michael Allman <michael@videoamp.com <mailto:michael@videoamp.com>>
wrote:
> Hi All,
> 
> Is anyone working on updating Spark's Parquet library dep to 1.9? If not, I can at least
get started on it and publish a PR.
> 
> Cheers,
> 
> Michael
> ---------------------------------------------------------------------
> To unsubscribe e-mail: dev-unsubscribe@spark.apache.org <mailto:dev-unsubscribe@spark.apache.org>
> 
> 
> 
> 
> -- 
> Ryan Blue
> Software Engineer
> Netflix
> 
> 
> 
> 
> -- 
> Ryan Blue
> Software Engineer
> Netflix


Mime
View raw message