spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sean Owen <so...@cloudera.com>
Subject Re: Ensuring eager evaluation inside mapPartitions
Date Fri, 16 Oct 2015 10:08:29 GMT
If you mean, getResult is called on the result of foo for each record, then
that already happens. If you mean getResults is called only after foo has
been called on all records, then you have to collect to a list, yes.

Why does it help with foo being slow in either case though?
You can try to consume the iterator in parallel with ".par" if that's what
you're getting at.

On Fri, Oct 16, 2015 at 10:47 AM, alberskib <alberskib@gmail.com> wrote:

> Hi all,
>
> I am wondering whether there is way to ensure that two consecutive maps
> inside mapPartition will not be chained together.
>
> To illustrate my question I prepared short example:
>
> rdd.mapPartitions(it => {
>     it.map(x => foo(x)).map(y => y.getResult)
> }
>
> I would like to ensure that foo method will be applied to all records (from
> partition) and only after that method getResult invoked on each record. It
> could be beneficial in situation that foo method is some kind of time
> consuming IO operation i.e. request to external service for data (data that
> couldn't be prefetched).
>
> I know that converting iterator into list will do the job but maybe there
> is
> more clever way for doing it.
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Ensuring-eager-evaluation-inside-mapPartitions-tp25085.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> For additional commands, e-mail: user-help@spark.apache.org
>
>

Mime
View raw message