spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Apache Spark (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (SPARK-11437) createDataFrame shouldn't .take() when provided schema
Date Sat, 31 Oct 2015 21:28:27 GMT

    [ https://issues.apache.org/jira/browse/SPARK-11437?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14984166#comment-14984166
] 

Apache Spark commented on SPARK-11437:
--------------------------------------

User 'JasonMWhite' has created a pull request for this issue:
https://github.com/apache/spark/pull/9392

> createDataFrame shouldn't .take() when provided schema
> ------------------------------------------------------
>
>                 Key: SPARK-11437
>                 URL: https://issues.apache.org/jira/browse/SPARK-11437
>             Project: Spark
>          Issue Type: Improvement
>          Components: PySpark
>            Reporter: Jason White
>
> When creating a DataFrame from an RDD in PySpark, `createDataFrame` calls `.take(10)`
to verify the first 10 rows of the RDD match the provided schema. Similar to https://issues.apache.org/jira/browse/SPARK-8070,
but that issue affected cases where a schema was not provided.
> Verifying the first 10 rows is of limited utility and causes the DAG to be executed non-lazily.
If necessary, I believe this verification should be done lazily on all rows. However, since
the caller is providing a schema to follow, I think it's acceptable to simply fail if the
schema is incorrect.
> https://github.com/apache/spark/blob/master/python/pyspark/sql/context.py#L321-L325



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message