spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Reynold Xin <r...@databricks.com>
Subject Re: Output Side Effects for different chain of operations
Date Thu, 15 Dec 2016 19:42:15 GMT
You can just write some files out directly (and idempotently) in your
map/mapPartitions functions. It is just a function that you can run
arbitrary code after all.


On Thu, Dec 15, 2016 at 11:33 AM, Chawla,Sumit <sumitkchawla@gmail.com>
wrote:

> Any suggestions on this one?
>
> Regards
> Sumit Chawla
>
>
> On Tue, Dec 13, 2016 at 8:31 AM, Chawla,Sumit <sumitkchawla@gmail.com>
> wrote:
>
>> Hi All
>>
>> I have a workflow with different steps in my program. Lets say these are
>> steps A, B, C, D.  Step B produces some temp files on each executor node.
>> How can i add another step E which consumes these files?
>>
>> I understand the easiest choice is  to copy all these temp files to any
>> shared location, and then step E can create another RDD from it and work on
>> that.  But i am trying to avoid this copy.  I was wondering if there is any
>> way i can queue up these files for E as they are getting generated on
>> executors.  Is there any possibility of creating a dummy RDD in start of
>> program, and then push these files into this RDD from each executor.
>>
>> I take my inspiration from the concept of Side Outputs in Google Dataflow:
>>
>> https://cloud.google.com/dataflow/model/par-do#emitting-to-
>> side-outputs-in-your-dofn
>>
>>
>>
>> Regards
>> Sumit Chawla
>>
>>
>

Mime
View raw message