spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Anastasios Zouzias <zouz...@gmail.com>
Subject Re: Isolate 1 partition and perform computations
Date Mon, 16 Apr 2018 17:11:29 GMT
Hi all,

I think this is doable using the mapPartitionsWithIndex method of RDD.

Example:

val partitionIndex = 0 // Your favorite partition index here

val rdd = spark.sparkContext.parallelize(Array.range(0, 1000))

// Replace elements of partitionIndex with [-10, .. ,0]

val fixed = rdd.mapPartitionsWithIndex{case (idx, iter) => if (idx ==
partitionIndex) Array.range(-10, 0).toIterator else iter}


Best regards,
Anastasios


On Sun, Apr 15, 2018 at 12:59 AM, Thodoris Zois <zois@ics.forth.gr> wrote:

> I forgot to mention that I would like my approach to be independent from
> the application that user is going to submit to Spark.
>
> Assume that I don’t know anything about user’s application… I expected to
> find a simpler approach. I saw in RDD.scala that an RDD is characterized by
> a list of partitions. If I modify this list and keep only one partition, is
> it going to work?
>
> - Thodoris
>
>
> > On 15 Apr 2018, at 01:40, Matthias Boehm <mboehm7@gmail.com> wrote:
> >
> > you might wanna have a look into using a PartitionPruningRDD to select
> > a subset of partitions by ID. This approach worked very well for
> > multi-key lookups for us [1].
> >
> > A major advantage compared to scan-based operations is that, if your
> > source RDD has an existing partitioner, only relevant partitions are
> > accessed.
> >
> > [1] https://github.com/apache/systemml/blob/master/src/main/
> java/org/apache/sysml/runtime/instructions/spark/
> MatrixIndexingSPInstruction.java#L603
> >
> > Regards,
> > Matthias
> >
> > On Sat, Apr 14, 2018 at 3:12 PM, Thodoris Zois <zois@ics.forth.gr>
> wrote:
> >> Hello list,
> >>
> >> I am sorry for sending this message here, but I could not manage to get
> any response in “users”. For specific purposes I would like to isolate 1
> partition of the RDD and perform computations only to this.
> >>
> >> For instance, suppose that a user asks Spark to create 500 partitions
> for the RDD. I would like Spark to create the partitions but perform
> computations only in one partition from those 500 ignoring the other 499.
> >>
> >> At first I tried to modify executor in order to run only 1 partition
> (task) but I didn’t manage to make it work. Then I tried the DAG Scheduler
> but I think that I should modify the code in a higher level and let Spark
> make the partitioning but at the end see only one partition and throw throw
> away all the others.
> >>
> >> My question is which file should I modify in order to achieve isolating
> 1 partition of the RDD? Where does the actual partitioning is made?
> >>
> >> I hope it is clear!
> >>
> >> Thank you very much,
> >> Thodoris
> >>
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
> >>
>
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
>
>


-- 
-- Anastasios Zouzias
<azo@zurich.ibm.com>

Mime
View raw message