Hi Fabian,

I was trying to use the strategy you sugges= ted with flink 0.8.1 but it seems that the union of the datasets cannot be = created programmatically because the union operator gives a name to the gen= erated dataset that is the name of the calling function so that =C2=A0only = the first dataset is read. My code looks like:

private static DataSet= <Tuple6<...> getSourceDs(ExecutionEnvironment env,=C2=A0final String outputGraph, List<String> t= ableNames) {

= DataSet<Tuple6<...>> ret =3D null;

for (String tableName : tableNames) {

DataSet<Tupl= e6<...>> sourceDs =3D=C2=A0env.createInput(new MyTableInputFormat(= tableName))

=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 ....

if(ret=3D=3Dnull)

= ret =3D sourceDs;

else

= ret.union(sourceDs);<= /div>

=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0}

=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 return ret;

= =C2=A0 =C2=A0 =C2=A0 =C2=A0}

Is this a bug or am I= 'm doing something wrong?

Thanks in advance,

Flavio=

On Mon, Dec= 22, 2014 at 2:42 PM, <fhueske@gmail.com> wrote:

Union is just combining data from multiple sources into= a single dataset.

That=E2=80=99s it. N= o memory, no disk involved.

In you case you have

input1.union(input2).group= By(1).reduce(=E2=80=A6)

This will translate into:

input1 -> repartition = ->

=C2= =A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0 read-both-inputs ->=C2=A0 sort -> reduce

input2 -> repartition ->

S= o, in your case not even additional network transfer is involved, because b= oth data sets would need to be partitioned for the reduce anyway.

Note, uni= on in=C2=A0Flink=C2=A0has=C2=A0SQL union-all semantics, i.e., there is not= =C2=A0removal of duplicates.

<= div style=3D"font-size:11pt">Cheers, Fabian=C2=A0

From:=C2=A0Flavio Pompermaier
<= b>Sent:=C2=A0=E2=80=8EMonday=E2=80=8E, =E2=80=8E22=E2=80=8E. =E2=80=8ED= ecember=E2=80=8E, =E2=80=8E2014 =E2=80=8E14=E2=80=8E:=E2=80=8E32
To:<= /b>=C2=A0user@flink.incubator.apache.org

Ok thanks Fabian. I'= d like just to know the internals of the union of multiple datasets (partit= ioning, distribution among server, memory/disk, etc..). Do you have any ref= to this?

Thanks in advance,
Flavio

On Mon, Dec 22, 2014 a= t 12:46 PM, Fabian Hueske <fhueske@apache.org> wrote:
Follow the first approach.=C2=A0
Joins = are expensive, union comes for free.

Best, Fab= ian

2014-12-22 11:47 GMT+01:00 Flavio Pompermaier <= ;pompermaier@okka= m.it>:
Hi guys,

In my use case I have multiple Datasets with the same structure (e.g. = Tuple3) and I want to produce an output Dataset containing all Tuple3 group= ed by the first field (0).
I can obtain the same results performi= ng a union of all datasets and then a group by (simplest implementation) or= join all of them pairwise (((A->B)->C)->D)..) or I don't know= if there is any other solution. When should I use the first or the second = approach? Could you help me in figuring out the internals of the two approa= ches? I always have some fear when using multiple joins when I don't kn= ow exactly their size..

Best,
Flavio

=