spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ryan Blue <rb...@netflix.com.INVALID>
Subject Re: Updating Parquet dep to 1.9
Date Tue, 01 Nov 2016 20:22:00 GMT
I can when I'm finished with a couple other issues if no one gets to it
first.

Michael, if you're interested in updating to 1.9.0 I'm happy to help review
that PR.

On Tue, Nov 1, 2016 at 1:03 PM, Reynold Xin <rxin@databricks.com> wrote:

> Ryan want to submit a pull request?
>
>
> On Tue, Nov 1, 2016 at 9:05 AM, Ryan Blue <rblue@netflix.com.invalid>
> wrote:
>
>> 1.9.0 includes some fixes intended specifically for Spark:
>>
>> * PARQUET-389: Evaluates push-down predicates for missing columns as
>> though they are null. This is to address Spark's work-around that requires
>> reading and merging file schemas, even for metastore tables.
>> * PARQUET-654: Adds an option to disable record-level predicate
>> push-down, but keep row group evaluation. This allows Spark to skip row
>> groups based on stats and dictionaries, but implement its own vectorized
>> record filtering.
>>
>> The Parquet community also evaluated performance to ensure no performance
>> regressions from moving to the ByteBuffer read path.
>>
>> There is one concern about 1.9.0 that will be addressed in 1.9.1, which
>> is that stats calculations were incorrectly using unsigned byte order for
>> string comparison. This means that min/max stats can't be used if the data
>> contains (or may contain) UTF8 characters with the msb set. 1.9.0 won't
>> return the bad min/max values for correctness, but there is a property to
>> override this behavior for data that doesn't use the affected code points.
>>
>> Upgrading to 1.9.0 depends on how the community wants to handle the sort
>> order bug: whether correctness or performance should be the default.
>>
>> rb
>>
>> On Tue, Nov 1, 2016 at 2:22 AM, Sean Owen <sowen@cloudera.com> wrote:
>>
>>> Yes this came up from a different direction: https://issues.apac
>>> he.org/jira/browse/SPARK-18140
>>>
>>> I think it's fine to pursue an upgrade to fix these several issues. The
>>> question is just how well it will play with other components, so bears some
>>> testing and evaluation of the changes from 1.8, but yes this would be good.
>>>
>>> On Mon, Oct 31, 2016 at 9:07 PM Michael Allman <michael@videoamp.com>
>>> wrote:
>>>
>>>> Hi All,
>>>>
>>>> Is anyone working on updating Spark's Parquet library dep to 1.9? If
>>>> not, I can at least get started on it and publish a PR.
>>>>
>>>> Cheers,
>>>>
>>>> Michael
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
>>>>
>>>>
>>
>>
>> --
>> Ryan Blue
>> Software Engineer
>> Netflix
>>
>
>


-- 
Ryan Blue
Software Engineer
Netflix

Mime
View raw message