spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Apache Spark (JIRA)" <>
Subject [jira] [Commented] (SPARK-11437) createDataFrame shouldn't .take() when provided schema
Date Sat, 31 Oct 2015 21:28:27 GMT


Apache Spark commented on SPARK-11437:

User 'JasonMWhite' has created a pull request for this issue:

> createDataFrame shouldn't .take() when provided schema
> ------------------------------------------------------
>                 Key: SPARK-11437
>                 URL:
>             Project: Spark
>          Issue Type: Improvement
>          Components: PySpark
>            Reporter: Jason White
> When creating a DataFrame from an RDD in PySpark, `createDataFrame` calls `.take(10)`
to verify the first 10 rows of the RDD match the provided schema. Similar to,
but that issue affected cases where a schema was not provided.
> Verifying the first 10 rows is of limited utility and causes the DAG to be executed non-lazily.
If necessary, I believe this verification should be done lazily on all rows. However, since
the caller is providing a schema to follow, I think it's acceptable to simply fail if the
schema is incorrect.

This message was sent by Atlassian JIRA

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message