spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sean Owen (JIRA)" <>
Subject [jira] [Commented] (SPARK-13744) Dataframe RDD caching increases the input size for subsequent stages
Date Tue, 08 Mar 2016 20:48:40 GMT


Sean Owen commented on SPARK-13744:

It's reporting the number of bytes read, which does indeed depend on whether it was read from
disk or memory. It is much smaller when read from disk in this case; look at the size of the
Parquet file you generate. The stage detail page elaborates this. It's correct as far as I
can see, and explained in the UI correctly too in the details page.

> Dataframe RDD caching increases the input size for subsequent stages
> --------------------------------------------------------------------
>                 Key: SPARK-13744
>                 URL:
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL, Web UI
>    Affects Versions: 1.6.0
>         Environment: OSX
>            Reporter: Justin Pihony
>            Priority: Minor
>         Attachments: Screen Shot 2016-03-08 at 10.35.51 AM.png
> Given the below code, you will see that the first run of count shows up as ~90KB, and
even the next run with cache being set will result in the same input size. However, every
subsequent run thereafter will result in an input size that is MUCH larger (500MB is listed
as 38% for a default run). This size discrepancy seems to be a bug in the caching of a dataframe's
RDD as far as I can see. 
> {code}
> import sqlContext.implicits._
> case class Person(name:String ="Test", number:Double = 1000.2)
> val people = sc.parallelize(1 to 10000000,50).map { p => Person()}.toDF
> people.write.parquet("people.parquet")
> val parquetFile ="people.parquet")
> parquetFile.rdd.count()
> parquetFile.rdd.cache()
> parquetFile.rdd.count()
> parquetFile.rdd.count()
> {code}

This message was sent by Atlassian JIRA

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message