spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Hossein Falaki (JIRA)" <>
Subject [jira] [Commented] (SPARK-17790) Support for parallelizing data.frame larger than 2GB
Date Wed, 05 Oct 2016 21:28:20 GMT


Hossein Falaki commented on SPARK-17790:

Thanks for pointing it out. SPARK-6235 seems to be an umbrella ticket. This  one can be a
subtask of it.

> Support for parallelizing data.frame larger than 2GB
> ----------------------------------------------------
>                 Key: SPARK-17790
>                 URL:
>             Project: Spark
>          Issue Type: Story
>          Components: SparkR
>    Affects Versions: 2.0.1
>            Reporter: Hossein Falaki
> This issue is a more specific version of SPARK-17762. 
> Supporting larger than 2GB arguments is more general and arguably harder to do because
the limit exists both in R and JVM (because we receive data as a ByteArray). However, to support
parallalizing R data.frames that are larger than 2GB we can do what PySpark does.
> PySpark uses files to transfer bulk data between Python and JVM. It has worked well for
the large community of Spark Python users. 

This message was sent by Atlassian JIRA

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message