flink-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Flavio Pompermaier <pomperma...@okkam.it>
Subject CSV sink partitioning and bucketing
Date Fri, 17 Feb 2017 08:32:34 GMT
Hi to all,
in my use case I'd need to output my Row objects into an output folder as
CSV on HDFS but creating/overwriting new subfolders based on an attribute
(for example create a subfolder for each value of a specified column).
Then, it could be interesting to bucketing the data inside those folders by
number of lines,i.e. every file inside those directory cannot contain more
than 1000 lines.

For example, if I have a dataset (of Row) containing people I need to write
my dataset as CSV into an output folder X  partitioned by year (where each
file cannot have more then 1000 rows), like:

X/1990/file1
   /1990/file2
   /1991/file1
etc..

Does something like that exists in Flink?
In principle I could use Hive for this but at the moment I'd try to avoid
to add another component to our pipeline...Moreover, my feeling is that
very few people is using Flink on Hive..am I wrong?
Any advice on how to proceed?

Best,
Flavio

Mime
View raw message