spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Stuart Reynolds (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (SPARK-21392) Unable to infer schema when loading Parquet file
Date Wed, 12 Jul 2017 20:05:00 GMT

     [ https://issues.apache.org/jira/browse/SPARK-21392?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Stuart Reynolds updated SPARK-21392:
------------------------------------
    Description: 
The following boring code works

{code:python}
    response = "mi_or_chd_5"

    outcome = sqlc.sql("""select eid,{response} as response
    from outcomes
    where {response} IS NOT NULL""".format(response=response))
    outcome.write.parquet(response, mode="overwrite")
    
    >>> print outcome.schema
    StructType(List(StructField(eid,IntegerType,true),StructField(response,ShortType,true)))
{code}
    
But then,
{code:python}
    outcome2 = sqlc.read.parquet(response)  # fail
{code}

fails with:

{code:python}AnalysisException: u'Unable to infer schema for Parquet. It must be specified
manually.;'
{code}

in 

{code:python} /usr/local/lib/python2.7/dist-packages/pyspark-2.1.0+hadoop2.7-py2.7.egg/pyspark/sql/utils.pyc
in deco(*a, **kw)
{code}

The documentation for parquet says the format is self describing, and the full schema was
available when the parquet file was saved. What gives?

Seems related to: https://issues.apache.org/jira/browse/SPARK-16975, but which claims it was
fixed in 2.0.1, 2.1.0. (Current bug is 2.1.1)


  was:
The following boring code works

{code:python}
    response = "mi_or_chd_5"
    colname = "f123"

    outcome = sqlc.sql("""select eid,{response} as response
    from outcomes
    where {response} IS NOT NULL""".format(response=response))
    outcome.write.parquet(response, mode="overwrite")
    
    col = sqlc.sql("""select eid,{colname} as {colname}
    from baseline_denull
    where {colname} IS NOT NULL""".format(colname=colname))
    col.write.parquet(colname, mode="overwrite")

    >>> print outcome.schema
    StructType(List(StructField(eid,IntegerType,true),StructField(response,ShortType,true)))

    >>> print col.schema
    StructType(List(StructField(eid,IntegerType,true),StructField(f123,DoubleType,true)))
{code}
    
But then,
{code:python}
    outcome2 = sqlc.read.parquet(response)  # fail
    col2 = sqlc.read.parquet(colname) # fail
{code}

fails with:

{code:python}AnalysisException: u'Unable to infer schema for Parquet. It must be specified
manually.;'
{code}

in 

{code:python} /usr/local/lib/python2.7/dist-packages/pyspark-2.1.0+hadoop2.7-py2.7.egg/pyspark/sql/utils.pyc
in deco(*a, **kw)
{code}

The documentation for parquet says the format is self describing, and the full schema was
available when the parquet file was saved. What gives?

Seems related to: https://issues.apache.org/jira/browse/SPARK-16975, but which claims it was
fixed in 2.0.1, 2.1.0. (Current bug is 2.1.1)



> Unable to infer schema when loading Parquet file
> ------------------------------------------------
>
>                 Key: SPARK-21392
>                 URL: https://issues.apache.org/jira/browse/SPARK-21392
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark
>    Affects Versions: 2.1.1
>         Environment: Spark 2.1.1. python 2.7.6
>            Reporter: Stuart Reynolds
>              Labels: parquet, pyspark
>
> The following boring code works
> {code:python}
>     response = "mi_or_chd_5"
>     outcome = sqlc.sql("""select eid,{response} as response
>     from outcomes
>     where {response} IS NOT NULL""".format(response=response))
>     outcome.write.parquet(response, mode="overwrite")
>     
>     >>> print outcome.schema
>     StructType(List(StructField(eid,IntegerType,true),StructField(response,ShortType,true)))
> {code}
>     
> But then,
> {code:python}
>     outcome2 = sqlc.read.parquet(response)  # fail
> {code}
> fails with:
> {code:python}AnalysisException: u'Unable to infer schema for Parquet. It must be specified
manually.;'
> {code}
> in 
> {code:python} /usr/local/lib/python2.7/dist-packages/pyspark-2.1.0+hadoop2.7-py2.7.egg/pyspark/sql/utils.pyc
in deco(*a, **kw)
> {code}
> The documentation for parquet says the format is self describing, and the full schema
was available when the parquet file was saved. What gives?
> Seems related to: https://issues.apache.org/jira/browse/SPARK-16975, but which claims
it was fixed in 2.0.1, 2.1.0. (Current bug is 2.1.1)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message