spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sean Owen <so...@cloudera.com>
Subject Re: RDD.count
Date Sat, 28 Mar 2015 04:04:40 GMT
I assume because map() could have side effects? Even if that's not
generally a good idea. The expectation or contract is that it is still
invoked. In this program the caller could also call count() on the parent.
On Mar 28, 2015 1:00 AM, "jimfcarroll" <jimfcarroll@gmail.com> wrote:

> Hi all,
>
> I was wondering why the RDD.count call recomputes the RDD in all cases? In
> most cases it can simply ask the next dependent RDD. I have several RDD
> implementations and was surprised to see a call like the following never
> call my RDD's count method but instead recompute/traverse the entire
> dataset:
>
>    val myRDD: MyRDD = ...
>    myRDD.map({ ... }).count()
>
> Unless I'm mistaken, a MappedRDD never needs to do more than call 'count'
> on
> the underlying RDD. The underlying RDD's count method (in all of my cases)
> know their count without a recompute (e.g. one of them selects the count
> from a DB). This is MUCH less expensive than recomputing the RDD.
>
> Thanks.
> Jim
>
>
>
>
> --
> View this message in context:
> http://apache-spark-developers-list.1001551.n3.nabble.com/RDD-count-tp11298.html
> Sent from the Apache Spark Developers List mailing list archive at
> Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
> For additional commands, e-mail: dev-help@spark.apache.org
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message