flink-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michele Bertoni <michele1.bert...@mail.polimi.it>
Subject Re: intermediate result reuse
Date Sat, 12 Sep 2015 16:44:03 GMT
ok, I think I got the point: I don’t have two execute but a collect in some branch
I will look for a way to remove it


What I am doing is to keep all the elements of A that as value equal to something in B, where
B (at this point) is very small
Is it better to collect or a cogroup?


btw is something you expect to solve i further versions?


thanks
michele




Il giorno 12/set/2015, alle ore 16:27, Stephan Ewen <sewen@apache.org<mailto:sewen@apache.org>>
ha scritto:

Fabian has explained it well. All functions are executed lazily as one DAG, when "env.execute()"
is called.

Beware that there are three exceptions:
  - count()
  - collect()
  - print()

These functions trigger an immediate program execution (they are "eager" functions). They
will execute all that is needed for produce their result. Summing up:

---------------------------

One execution in this case (result "a" is reused by "b" and "c")

a = env.createInput() -> map() -> reduce() -> filter()

b = a.flatmap()
c = a.groupBy() -> reduce()

b.writeAsText()
c.writeAsCsv()

env.execute();

---------------------------

Two executions in this case ("a" is computed twice, once for "b" and once for "c")

a = env.createInput() -> map() -> reduce() -> filter()

b = a -> flatmap() -> count()
c = a -> groupBy() -> reduce().collect()

---------------------------

Greetings,
Stephan


On Sat, Sep 12, 2015 at 11:31 AM, Fabian Hueske <fhueske@gmail.com<mailto:fhueske@gmail.com>>
wrote:
Hi Michele,

Flink programs can have multiple sinks.
In your program, the intermediate result a will be streamed to both filters (b and c) at the
same time and both sinks will be written at the same time.
So in this case, there is no need to materialize the intermediate result a.

If you call execute() after you defined b, the program will compute a and stream the result
only to b.
If you call execute() again after you defined c, the program will compute a again and stream
the result to c.

Summary:
Flink programs can usually stream intermediate results without materializing them. There are
a few cases where it needs to materialize intermediate results in order to avoid deadlocks,
but these are fully transparently handled.
It is not possible (yet!) to share results across program executions, i.e., whenever you call
execute().

I suppose, you call execute() between defining b and c. If you execute that call, a will be
computed once and both b and c are computed at the same time.

Best, Fabian

2015-09-12 11:02 GMT+02:00 Michele Bertoni <michele1.bertoni@mail.polimi.it<mailto:michele1.bertoni@mail.polimi.it>>:
Hi everybody, I have a question about internal optimization
is flink able to reuse intermediate result that are used twice in the graph?

i.e.
a = readsource -> filter -> reduce -> something else even more complicated

b = a filter(something)
store b

c = a filter(something else)
store c

what happens to a? is it computed twice?

in my read function I have a some logging commands and I see the printed twice, but it sounds
strange to me



thanks
cheers
michele



Mime
View raw message