spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Evo Eftimov" <evo.efti...@isecc.com>
Subject RE: Map one RDD into two RDD
Date Thu, 07 May 2015 20:46:27 GMT
1. Will rdd2.filter run before rdd1.filter finish? 

 

YES

 

2. We have to traverse rdd twice. Any comments?

 

You can invoke filter or whatever other transformation / function many times 



Ps: you  have to study / learn the Parallel Programming Model of an OO Framework like Spark
– in any OO Framework lots of Behavior is hidden / encapsulated by the Framework and the
client code gets invoked at specific points in the Flow of Control / Data based on callback
functions 

 

That’s why stuff like RDD.filter(), RDD.filter() may look “sequential” to you but it
is not  

 

 

From: Bill Q [mailto:bill.q.hdp@gmail.com] 
Sent: Thursday, May 7, 2015 6:27 PM
To: Evo Eftimov
Cc: user@spark.apache.org
Subject: Re: Map one RDD into two RDD

 

The multi-threading code in Scala is quite simple and you can google it pretty easily. We
used the Future framework. You can use Akka also.

 

@Evo My concerns for filtering solution are: 1. Will rdd2.filter run before rdd1.filter finish?
2. We have to traverse rdd twice. Any comments?



On Thursday, May 7, 2015, Evo Eftimov <evo.eftimov@isecc.com> wrote:

Scala is a language, Spark is an OO/Functional, Distributed Framework facilitating Parallel
Programming in a distributed environment 

 

Any “Scala parallelism” occurs within the Parallel Model imposed by the Spark OO Framework
– ie it is limited in terms of what it can achieve in terms of influencing the Spark Framework
behavior – that is the nature of programming with/for frameworks 

 

When RDD1 and RDD2 are partitioned and different Actions applied to them this will result
in Parallel Pipelines / DAGs within the Spark Framework

RDD1 = RDD.filter()

RDD2 = RDD.filter()

 

 

From: Bill Q [mailto:bill.q.hdp@gmail.com <javascript:_e(%7B%7D,'cvml','bill.q.hdp@gmail.com');>
] 
Sent: Thursday, May 7, 2015 4:55 PM
To: Evo Eftimov
Cc: user@spark.apache.org <javascript:_e(%7B%7D,'cvml','user@spark.apache.org');> 
Subject: Re: Map one RDD into two RDD

 

Thanks for the replies. We decided to use concurrency in Scala to do the two mappings using
the same source RDD in parallel. So far, it seems to be working. Any comments?

On Wednesday, May 6, 2015, Evo Eftimov <evo.eftimov@isecc.com <javascript:_e(%7B%7D,'cvml','evo.eftimov@isecc.com');>
> wrote:

RDD1 = RDD.filter()

RDD2 = RDD.filter()

 

From: Bill Q [mailto:bill.q.hdp@gmail.com] 
Sent: Tuesday, May 5, 2015 10:42 PM
To: user@spark.apache.org
Subject: Map one RDD into two RDD

 

Hi all,

I have a large RDD that I map a function to it. Based on the nature of each record in the
input RDD, I will generate two types of data. I would like to save each type into its own
RDD. But I can't seem to find an efficient way to do it. Any suggestions?

 

Many thanks.

 

 

Bill



-- 

Many thanks.

Bill

 



-- 

Many thanks.

Bill

 



-- 

Many thanks.



Bill

 


Mime
View raw message