beam-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Amit Sela <amitsel...@gmail.com>
Subject Re: collect to local
Date Mon, 20 Feb 2017 10:04:34 GMT
Spark runner's EvaluationContext
<https://github.com/apache/beam/blob/master/runners/spark/src/main/java/org/apache/beam/runners/spark/translation/EvaluationContext.java#L201>
has a hook ready for this - but clearly only for batch, in streaming this
feature doesn't seem relevant.

You can easily hack this in the Spark runner, but for Beam in general I
wonder how this would work in a runner-agnostic way ?
Spark has a driver process, not sure how this works for other runners.

On Mon, Feb 20, 2017 at 11:54 AM Antony Mayi <antonymayi@yahoo.com> wrote:

> Thanks Jean,
>
> my point is to retrieve the data represented let say as
> PCollection<String> to a List<String> (not PCollection<List<String>>)
-
> essentially fetching it all to address space of the local driver process
> (this is what Spark's .collect() does). It would be a reverse of the
> beam.sdks.transforms.Create - which takes a local iterable and distributes
> it into PCollection - at the end of the pipeline I would like to get the
> result back to single iterable (hence I assuming it would need some type of
> Sink).
>
> Thanks,
> Antony.
>
>
> On Monday, 20 February 2017, 10:40, Jean-Baptiste Onofré <jb@nanthrax.net>
> wrote:
>
>
> Hi Antony,
>
> The Spark runner deals with caching/persist for you (analyzing how many
> time the same PCollection is used).
>
> For the collect(), I don't fully understand your question.
>
> If if you want to process elements in the PCollection, you can do simple
> ParDo:
>
> .apply(ParDo.of(new DoFn() {
>   @ProcessElement
>   public void processElements(ProcessContext context) {
>     T element = context.element();
>     // do whatever you want
>   }
> })
>
> Is it not what you want ?
>
> Regards
> JB
>
> On 02/20/2017 10:30 AM, Antony Mayi wrote:
> > Hi,
> >
> > what is the best way to fetch content of PCollection to local
> > memory/process (something like calling .collect() on Spark rdd)? Do I
> > need to implement custom Sink?
> >
> > Thanks for any clues,
> > Antony.
>
>
> --
> Jean-Baptiste Onofré
> jbonofre@apache.org
> http://blog.nanthrax.net
> Talend - http://www.talend.com
>
>
>
>

Mime
View raw message