Hi Hyukjin,

Thanks for bringing this up. Could you please make a PR for this one? We didn't use PARQUET_2_0 mostly because it's less mature than PARQUET_1_0, but we should let users choose the writer version, as long as PARQUET_1_0 remains the default option.

Cheng

On 10/8/15 11:04 PM, Hyukjin Kwon wrote:
Hi all,

While wring some parquet files by Spark, I found it actually only writes the parquet files with writer version1.

This differs encoding types of the file. 

Is this intendedly fixed for some reasons? 


I changed codes and tested to write this as writer version2 and it looks fine.

In more details,
I found it fixes the writer version in org.apache.spark.sql.execution.datasources.parquet.CatalystWriteSupport.scala

def setSchema(schema: StructType, configuration: Configuration): Unit = {
  schema.map(_.name).foreach(CatalystSchemaConverter.checkFieldName)
  configuration.set(SPARK_ROW_SCHEMA, schema.json)
  configuration.set(
    ParquetOutputFormat.WRITER_VERSION,
    ParquetProperties.WriterVersion.PARQUET_1_0.toString)
}

I changed this to this in order to keep the given configuration

def setSchema(schema: StructType, configuration: Configuration): Unit = {
  schema.map(_.name).foreach(CatalystSchemaConverter.checkFieldName)
  configuration.set(SPARK_ROW_SCHEMA, schema.json)
  configuration.set(
    ParquetOutputFormat.WRITER_VERSION,
    configuration.get(ParquetOutputFormat.WRITER_VERSION,
      ParquetProperties.WriterVersion.PARQUET_1_0.toString)
  )
}

and set the version to version2 
sc.hadoopConfiguration.set(ParquetOutputFormat.WRITER_VERSION,
    ParquetProperties.WriterVersion.PARQUET_2_0.toString)