spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sean Owen (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (SPARK-6830) Memoize frequently queried vals in RDD, such as numPartitions, count etc.
Date Mon, 29 Jun 2015 19:00:04 GMT

    [ https://issues.apache.org/jira/browse/SPARK-6830?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14606157#comment-14606157
] 

Sean Owen commented on SPARK-6830:
----------------------------------

Is this valid? For example, consider an RDD from a file that's being written to. count() would
return larger values each time it is called. Caching it would change this behavior. Of course,
caching the RDD would also mean the count was then fixed, but these are semantically different.

> Memoize frequently queried vals in RDD, such as numPartitions, count etc.
> -------------------------------------------------------------------------
>
>                 Key: SPARK-6830
>                 URL: https://issues.apache.org/jira/browse/SPARK-6830
>             Project: Spark
>          Issue Type: Improvement
>          Components: SparkR
>            Reporter: Shivaram Venkataraman
>            Priority: Minor
>              Labels: Starter
>
> We should memoize frequently queried vals in RDD, such as numPartitions, count etc.
> While using SparkR in RStudio, the `count` function seems to be called frequently by
the IDE – I think this is to show some stats about variables in the workspace etc. but this
is not great in SparkR as we trigger a job every time count is called.
> Memoization would help in this case, but we should also see if there is some better way
to interact with RStudio.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message