spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Cheng Lian <lian.cs....@gmail.com>
Subject Re: Fixed writer version as version1 for Parquet as wring a Parquet file.
Date Fri, 09 Oct 2015 18:02:50 GMT
Hi Hyukjin,

Thanks for bringing this up. Could you please make a PR for this one? We 
didn't use PARQUET_2_0 mostly because it's less mature than PARQUET_1_0, 
but we should let users choose the writer version, as long as 
PARQUET_1_0 remains the default option.

Cheng

On 10/8/15 11:04 PM, Hyukjin Kwon wrote:
> Hi all,
>
> While wring some parquet files by Spark, I found it actually only 
> writes the parquet files with writer version1.
>
> This differs encoding types of the file.
>
> Is this intendedly fixed for some reasons?
>
>
> I changed codes and tested to write this as writer version2 and it 
> looks fine.
>
> In more details,
> I found it fixes the writer version in 
> org.apache.spark.sql.execution.datasources.parquet.CatalystWriteSupport.scala
>
> |def setSchema(schema: StructType, configuration: Configuration): Unit 
> = { schema.map(_.name).foreach(CatalystSchemaConverter.checkFieldName) 
> configuration.set(SPARK_ROW_SCHEMA, schema.json) configuration.set( 
> ParquetOutputFormat.WRITER_VERSION, 
> ParquetProperties.WriterVersion.PARQUET_1_0.toString) } |
> ​
>
> I changed this to this in order to keep the given configuration
>
> |def setSchema(schema: StructType, configuration: Configuration): Unit 
> = { schema.map(_.name).foreach(CatalystSchemaConverter.checkFieldName) 
> configuration.set(SPARK_ROW_SCHEMA, schema.json) configuration.set( 
> ParquetOutputFormat.WRITER_VERSION, 
> configuration.get(ParquetOutputFormat.WRITER_VERSION, 
> ParquetProperties.WriterVersion.PARQUET_1_0.toString) ) } |
> ​
>
> and set the version to version2
> |sc.hadoopConfiguration.set(ParquetOutputFormat.WRITER_VERSION, 
> ParquetProperties.WriterVersion.PARQUET_2_0.toString) |
> ​
>
>
>
>


Mime
View raw message