spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Dongjoon Hyun (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (SPARK-16975) Spark-2.0.0 unable to infer schema for parquet data written by Spark-1.6.2
Date Wed, 10 Aug 2016 20:12:20 GMT

    [ https://issues.apache.org/jira/browse/SPARK-16975?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15415932#comment-15415932
] 

Dongjoon Hyun commented on SPARK-16975:
---------------------------------------

I made a sample case having similar behaviors. I think this is related closed. [~rxin], how
do you think about this?

{code}
spark-1.6.2-bin-hadoop2.6$ ls /tmp/parquet16/
_SUCCESS         _locality_code=1 _locality_code=3 _locality_code=5 _locality_code=7 _locality_code=9
_locality_code=0 _locality_code=2 _locality_code=4 _locality_code=6 _locality_code=8
{code}

{code}
scala> spark.read.parquet("/tmp/parquet16").show
org.apache.spark.sql.AnalysisException: Unable to infer schema for ParquetFormat at /tmp/parquet16.
It must be specified manually;
scala> spark.read.parquet("/tmp/parquet16/_locality_code=0").show
+---+
| id|
+---+
|  0|
+---+
{code}

> Spark-2.0.0 unable to infer schema for parquet data written by Spark-1.6.2
> --------------------------------------------------------------------------
>
>                 Key: SPARK-16975
>                 URL: https://issues.apache.org/jira/browse/SPARK-16975
>             Project: Spark
>          Issue Type: Bug
>          Components: Input/Output
>    Affects Versions: 2.0.0
>         Environment: Ubuntu Linux 14.04
>            Reporter: immerrr again
>              Labels: parquet
>
> Spark-2.0.0 seems to have some problems reading a parquet dataset generated by 1.6.2.

> {code}
> In [80]: spark.read.parquet('/path/to/data')
> ...
> AnalysisException: u'Unable to infer schema for ParquetFormat at /path/to/data. It must
be specified manually;'
> {code}
> The dataset is ~150G and partitioned by _locality_code column. None of the partitions
are empty. I have narrowed the failing dataset to the first 32 partitions of the data:
> {code}
> In [82]: spark.read.parquet(*subdirs[:32])
> ...
> AnalysisException: u'Unable to infer schema for ParquetFormat at /path/to/data/_locality_code=AQ,/path/to/data/_locality_code=AI.
It must be specified manually;'
> {code}
> Interestingly, it works OK if you remove any of the partitions from the list:
> {code}
> In [83]: for i in range(32): spark.read.parquet(*(subdirs[:i] + subdirs[i+1:32]))
> {code}
> Another strange thing is that the schemas for the first and the last 31 partitions of
the subset are identical:
> {code}
> In [84]: spark.read.parquet(*subdirs[:31]).schema.fields == spark.read.parquet(*subdirs[1:32]).schema.fields
> Out[84]: True
> {code}
> Which got me interested and I tried this:
> {code}
> In [87]: spark.read.parquet(*([subdirs[0]] * 32))
> ...
> AnalysisException: u'Unable to infer schema for ParquetFormat at /path/to/data/_locality_code=AQ,/path/to/data/_locality_code=AQ.
It must be specified manually;'
> In [88]: spark.read.parquet(*([subdirs[15]] * 32))
> ...
> AnalysisException: u'Unable to infer schema for ParquetFormat at /path/to/data/_locality_code=AX,/path/to/data/_locality_code=AX.
It must be specified manually;'
> In [89]: spark.read.parquet(*([subdirs[31]] * 32))
> ...
> AnalysisException: u'Unable to infer schema for ParquetFormat at /path/to/data/_locality_code=BE,/path/to/data/_locality_code=BE.
It must be specified manually;'
> {code}
> If I read the first partition, save it in 2.0 and try to read in the same manner, everything
is fine:
> {code}
> In [100]: spark.read.parquet(subdirs[0]).write.parquet('spark-2.0-test')
> 16/08/09 11:03:37 WARN ParquetRecordReader: Can not initialize counter due to context
is not a instance of TaskInputOutputContext, but is org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl
> In [101]: df = spark.read.parquet(*(['spark-2.0-test'] * 32))
> {code}
> I have originally posted it to user mailing list, but with the last discoveries this
clearly seems like a bug.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message