flink-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Robert Metzger <rmetz...@apache.org>
Subject Re: MultipleFileOutput based on field
Date Mon, 23 Feb 2015 20:41:40 GMT
Hi,

right now, there is no shiny API in Flink to do this directly, but you can
use Hadoop's MultipleTextOutputFormat with Flink's HadoopOutputFormat
wrapper:
https://github.com/rmetzger/scratch/blob/mulioutput-flink/src/main/java/com/github/Job.java

The example looks quite messy but worked well locally.
It should also work on clusters (I haven't tested it).


There is always another way to solve these kinds of issues, using a
"tagged" DataSet.
DataSet<String> start;
DataSet<Tuple2<Int, String>> tagged = start.doSomething( { return new
Tuple2(<putOutputNumberHere>, "str"); } );
DataSet<String> out1 = tagged.filter( in.f1 == 0 );
DataSet<String> out2 = tagged.filter( in.f1 == 1 );
DataSet<String> out3 = tagged.filter( in.f1 == 2 );

and then you can write out the DataSet's out1 - out3 to separate files.
With this approach, you can "simulate" directing outputs from
"doSomething()" into different transformation chains / file outputs.

Best,
Robert


On Mon, Feb 23, 2015 at 4:43 PM, Yiannis Gkoufas <johngouf85@gmail.com>
wrote:

> Hi there,
>
> is it possible to write the results to HDFS on different files based a
> field of a tuple?
> Something similar to this:
> http://stackoverflow.com/questions/23995040/write-to-multiple-outputs-by-key-spark-one-spark-job
>
> Thanks a lot!
>

Mime
View raw message