spark-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From GitBox <...@apache.org>
Subject [GitHub] [spark] HyukjinKwon opened a new pull request #29098: [SPARK-32300][PYTHON][2.4] toPandas should work from a Spark DataFrame with no partitions
Date Tue, 14 Jul 2020 08:39:09 GMT

HyukjinKwon opened a new pull request #29098:
URL: https://github.com/apache/spark/pull/29098


   ### What changes were proposed in this pull request?
   
   This PR proposes to just simply by-pass the case when the number of array size is negative,
when it collects data from Spark DataFrame with no partitions for `toPandas`.
   
   ```python
   spark.sparkContext.emptyRDD().toDF("col1 int").toPandas()
   ```
   
   In the master and branch-3.0, this was fixed together at https://github.com/apache/spark/commit/ecaa495b1fe532c36e952ccac42f4715809476af
but it's legitimately not ported back.
   
   ### Why are the changes needed?
   
   To make empty Spark DataFrame able to be a pandas DataFrame.
   
   ### Does this PR introduce _any_ user-facing change?
   
   Yes,
   
   ```python
   spark.sparkContext.emptyRDD().toDF("col1 int").toPandas()
   ```
   
   **Before:**
   
   ```
   ...
   Caused by: java.lang.NegativeArraySizeException
   	at org.apache.spark.sql.Dataset$$anonfun$collectAsArrowToPython$1$$anonfun$apply$17.apply(Dataset.scala:3293)
   	at org.apache.spark.sql.Dataset$$anonfun$collectAsArrowToPython$1$$anonfun$apply$17.apply(Dataset.scala:3287)
   	at org.apache.spark.sql.Dataset$$anonfun$52.apply(Dataset.scala:3370)
   	at org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:80)
   ...
   ```
   
   **After:**
   
   ```
   Empty DataFrame
   Columns: [col1]
   Index: []
   ```
   
   ### How was this patch tested?
   
   Manually tested and unittest were added.
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Mime
View raw message