spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sean Owen <so...@cloudera.com>
Subject Re: Java RDD Union
Date Fri, 05 Dec 2014 20:58:37 GMT
foreach also creates a new RDD, and does not modify an existing RDD.
However, in practice, nothing stops you from fiddling with the Java
objects inside an RDD when you get a reference to them in a method
like this. This is definitely a bad idea, as there is certainly no
guarantee that any other operations will see any, some or all of these
edits.

On Fri, Dec 5, 2014 at 2:40 PM, Ron Ayoub <ronaldayoub@live.com> wrote:
> I tricked myself into thinking it was uniting things correctly. I see I'm
> wrong now.
>
> I have a question regarding your comment that RDD are immutable. Can you
> change values in an RDD using forEach. Does that violate immutability. I've
> been using forEach to modify RDD but perhaps I've tricked myself once again
> into believing it is working. I have object reference so perhaps it is
> working serendipitously in local mode since the references are in fact not
> changing but there are referents are and somehow this will no longer work
> when clustering.
>
> Thanks for comments.
>
>> From: sowen@cloudera.com
>> Date: Fri, 5 Dec 2014 14:22:38 -0600
>> Subject: Re: Java RDD Union
>> To: ronaldayoub@live.com
>> CC: user@spark.apache.org
>
>>
>> No, RDDs are immutable. union() creates a new RDD, and does not modify
>> an existing RDD. Maybe this obviates the question. I'm not sure what
>> you mean about releasing from memory. If you want to repartition the
>> unioned RDD, you repartition the result of union(), not anything else.
>>
>> On Fri, Dec 5, 2014 at 1:27 PM, Ron Ayoub <ronaldayoub@live.com> wrote:
>> > I'm a bit confused regarding expected behavior of unions. I'm running on
>> > 8
>> > cores. I have an RDD that is used to collect cluster associations
>> > (cluster
>> > id, content id, distance) for internal clusters as well as leaf clusters
>> > since I'm doing hierarchical k-means and need all distances for sorting
>> > documents appropriately upon examination.
>> >
>> > It appears that Union simply adds items in the argument to the RDD
>> > instance
>> > the method is called on rather than just returning a new RDD. If I want
>> > to
>> > do Union this was as more of an add/append should I be capturing the
>> > return
>> > value and releasing it from memory. Need help clarifying the semantics
>> > here.
>> >
>> > Also, in another related thread someone mentioned coalesce after union.
>> > Would I need to do the same on the instance RDD I'm calling Union on.
>> >
>> > Perhaps a method such as append would be useful and clearer.
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>> For additional commands, e-mail: user-help@spark.apache.org
>>

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org


Mime
View raw message