flink-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Maximilian Michels <...@apache.org>
Subject Re: Should collect() and count() be treated as data sinks?
Date Tue, 07 Apr 2015 16:21:59 GMT
On Mon, Apr 6, 2015 at 2:37 PM, Stephan Ewen <sewen@apache.org> wrote:

> BTW: Should "print()" be also an "eager" statement? I think it needs to be,
> if we want to print to the driver's std out


Yes, if we change print() to print on the Client, then it needs to execute
eagerly.

On Thu, Apr 2, 2015 at 6:59 PM, Alexander Alexandrov <
alexander.s.alexandrov@gmail.com> wrote:

> I would like to run a dataflow up to a particular point and materialize (in
> memory) the intermediate result. Is this possible at the moment?
>

Blocking execution mode is already implemented but the results are
currently not cached. That means that they are lost after they are consumed.

On Mon, Apr 6, 2015 at 2:37 PM, Stephan Ewen <sewen@apache.org> wrote:

> count() and collect() need to immediately trigger an execution, because the
> driver program cannot proceed otherwise. They are "eager".
>
> Regular sinks are "lazy", they wait until the program is triggered anyways.
>
> BTW: Should "print()" be also an "eager" statement? I think it needs to be,
> if we want to print to the driver's std out.
>
> On Thu, Apr 2, 2015 at 5:51 PM, Aljoscha Krettek <aljoscha@apache.org>
> wrote:
>
> > In my opinion it should not be handled like print. The idea behind
> > count()/collect() is that they immediately return the result which can
> > then be used in further flink operations.
> >
> > Right now, this is not properly/efficiently implemented but once we
> > have support for intermediate results on this level they start making
> > more sense. Also, in such a case an execute would not be required
> > after a collect()/count() if only the result of that call is required.
> >
> > On Thu, Apr 2, 2015 at 5:33 PM, Felix Neutatz <neutatz@googlemail.com>
> > wrote:
> > > Hi,
> > >
> > > I have run the following program:
> > >
> > > final ExecutionEnvironment env =
> > ExecutionEnvironment.getExecutionEnvironment();
> > >
> > > List l = Arrays.asList(new Tuple1<Long>(1L));
> > > TypeInformation t = TypeInfoParser.parse("Tuple1<Long>");
> > > DataSet<Tuple1<Long>> data = env.fromCollection(l, t);
> > >
> > > long value = data.count();
> > > System.out.println(value);
> > >
> > > env.execute("example");
> > >
> > >
> > > Since there is no "real" data sink, I get the following:
> > > Exception in thread "main" java.lang.RuntimeException: No data sinks
> have
> > > been created yet. A program needs at least one sink that consumes data.
> > > Examples are writing the data set or printing it.
> > >
> > > In my opinion, we should handle count() and collect() like print().
> > >
> > > What do you think?
> > >
> > > Best regards,
> > >
> > > Felix
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message