flink-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Fabian Hueske <fhue...@gmail.com>
Subject Re: CSV sink partitioning and bucketing
Date Fri, 17 Feb 2017 09:12:55 GMT
Hi Flavio,

Flink does not come with an OutputFormat that creates buckets. It should
not be too hard to implement this in Flink though.

However, if you want a solution fast, I would try the following approach:
1) Search for a Hadoop OutputFormat that buckets Strings based on a key
(<Key, String>).
2) Implement a mapper that converts Row into a String and extracts the key
3) Use the Hadoop OutputFormat with Flink's HadoopOutputFormat wrapper.

Depending on the output format you might want to partition and sort the
data on the key before writing it out.

Best, Fabian

2017-02-17 9:32 GMT+01:00 Flavio Pompermaier <pompermaier@okkam.it>:

> Hi to all,
> in my use case I'd need to output my Row objects into an output folder as
> CSV on HDFS but creating/overwriting new subfolders based on an attribute
> (for example create a subfolder for each value of a specified column).
> Then, it could be interesting to bucketing the data inside those folders by
> number of lines,i.e. every file inside those directory cannot contain more
> than 1000 lines.
>
> For example, if I have a dataset (of Row) containing people I need to
> write my dataset as CSV into an output folder X  partitioned by year (where
> each file cannot have more then 1000 rows), like:
>
> X/1990/file1
>    /1990/file2
>    /1991/file1
> etc..
>
> Does something like that exists in Flink?
> In principle I could use Hive for this but at the moment I'd try to avoid
> to add another component to our pipeline...Moreover, my feeling is that
> very few people is using Flink on Hive..am I wrong?
> Any advice on how to proceed?
>
> Best,
> Flavio
>
>

Mime
View raw message