spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Pei-Lun Lee <pl...@appier.com>
Subject SparkSQL 1.3.0 (RC3) failed to read parquet file generated by 1.1.1
Date Tue, 10 Mar 2015 09:06:16 GMT
Hi,

I found that if I try to read parquet file generated by spark 1.1.1 using
1.3.0-rc3 by default settings, I got this error:

com.fasterxml.jackson.core.JsonParseException: Unrecognized token
'StructType': was expecting ('true', 'false' or 'null')
 at [Source: StructType(List(StructField(a,IntegerType,false))); line: 1,
column: 11]
        at
com.fasterxml.jackson.core.JsonParser._constructError(JsonParser.java:1419)
        at
com.fasterxml.jackson.core.base.ParserMinimalBase._reportError(ParserMinimalBase.java:508)
        at
com.fasterxml.jackson.core.json.ReaderBasedJsonParser._reportInvalidToken(ReaderBasedJsonParser.java:2300)
        at
com.fasterxml.jackson.core.json.ReaderBasedJsonParser._handleOddValue(ReaderBasedJsonParser.java:1459)
        at
com.fasterxml.jackson.core.json.ReaderBasedJsonParser.nextToken(ReaderBasedJsonParser.java:683)
        at
com.fasterxml.jackson.databind.ObjectMapper._initForReading(ObjectMapper.java:3105)
        at
com.fasterxml.jackson.databind.ObjectMapper._readMapAndClose(ObjectMapper.java:3051)
        at
com.fasterxml.jackson.databind.ObjectMapper.readValue(ObjectMapper.java:2161)
        at org.json4s.jackson.JsonMethods$class.parse(JsonMethods.scala:19)
        at org.json4s.jackson.JsonMethods$.parse(JsonMethods.scala:44)
        at org.apache.spark.sql.types.DataType$.fromJson(dataTypes.scala:41)
        at
org.apache.spark.sql.parquet.ParquetRelation2$$anonfun$readSchema$1$$anonfun$25.apply(newParquet.scala:675)
        at
org.apache.spark.sql.parquet.ParquetRelation2$$anonfun$readSchema$1$$anonfun$25.apply(newParquet.scala:675)



this is how I save parquet file with 1.1.1:

sql("select 1 as a").saveAsParquetFile("/tmp/foo")



and this is the meta data of the 1.1.1 parquet file:

creator:     parquet-mr version 1.4.3
extra:       org.apache.spark.sql.parquet.row.metadata =
StructType(List(StructField(a,IntegerType,false)))



by comparison, this is 1.3.0 meta:

creator:     parquet-mr version 1.6.0rc3
extra:       org.apache.spark.sql.parquet.row.metadata =
{"type":"struct","fields":[{"name":"a","type":"integer","nullable":t
[more]...



It looks like now ParquetRelation2 is used to load parquet file by default
and it only recognizes JSON format schema but 1.1.1 schema was case class
string format.

Setting spark.sql.parquet.useDataSourceApi to false will fix it, but I
don't know the differences.
Is this considered a bug? We have a lot of parquet files from 1.1.1, should
we disable data source api in order to read them if we want to upgrade to
1.3?

Thanks,
--
Pei-Lun

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message