spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sean Owen <>
Subject Re: Java RDD Union
Date Fri, 05 Dec 2014 20:58:37 GMT
foreach also creates a new RDD, and does not modify an existing RDD.
However, in practice, nothing stops you from fiddling with the Java
objects inside an RDD when you get a reference to them in a method
like this. This is definitely a bad idea, as there is certainly no
guarantee that any other operations will see any, some or all of these

On Fri, Dec 5, 2014 at 2:40 PM, Ron Ayoub <> wrote:
> I tricked myself into thinking it was uniting things correctly. I see I'm
> wrong now.
> I have a question regarding your comment that RDD are immutable. Can you
> change values in an RDD using forEach. Does that violate immutability. I've
> been using forEach to modify RDD but perhaps I've tricked myself once again
> into believing it is working. I have object reference so perhaps it is
> working serendipitously in local mode since the references are in fact not
> changing but there are referents are and somehow this will no longer work
> when clustering.
> Thanks for comments.
>> From:
>> Date: Fri, 5 Dec 2014 14:22:38 -0600
>> Subject: Re: Java RDD Union
>> To:
>> CC:
>> No, RDDs are immutable. union() creates a new RDD, and does not modify
>> an existing RDD. Maybe this obviates the question. I'm not sure what
>> you mean about releasing from memory. If you want to repartition the
>> unioned RDD, you repartition the result of union(), not anything else.
>> On Fri, Dec 5, 2014 at 1:27 PM, Ron Ayoub <> wrote:
>> > I'm a bit confused regarding expected behavior of unions. I'm running on
>> > 8
>> > cores. I have an RDD that is used to collect cluster associations
>> > (cluster
>> > id, content id, distance) for internal clusters as well as leaf clusters
>> > since I'm doing hierarchical k-means and need all distances for sorting
>> > documents appropriately upon examination.
>> >
>> > It appears that Union simply adds items in the argument to the RDD
>> > instance
>> > the method is called on rather than just returning a new RDD. If I want
>> > to
>> > do Union this was as more of an add/append should I be capturing the
>> > return
>> > value and releasing it from memory. Need help clarifying the semantics
>> > here.
>> >
>> > Also, in another related thread someone mentioned coalesce after union.
>> > Would I need to do the same on the instance RDD I'm calling Union on.
>> >
>> > Perhaps a method such as append would be useful and clearer.
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail:
>> For additional commands, e-mail:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message