beam-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Vinay Patil <>
Subject Re: Rolling File Query
Date Mon, 26 Nov 2018 20:35:28 GMT
Hi Lukasz,

Thank you for your reply. I understood the part of buffering the elements
however I did not clearly understand this: "a fixed number of keys where
the fixed number of keys controls the write parallelization"  Let's say if
have only 4 keys then 4 files will be created but I need to create new file
when the buffer size is reached (10k), so it could happen that there are
more than 4 files created, correct ?

Using Existing FileSystem API - you mean without using existing FileIO,
right ?

I was looking for a API similar to BucketingSink API of Flink where it
rolls to a new file after batch size is reached.

Vinay Patil

On Mon, Nov 26, 2018 at 1:04 PM Lukasz Cwik <> wrote:

> You could use a StatefulDoFn to buffer up 10k worth of data and write it.
> There is an example for batched RPC[1] that could be re-used using the
> FileSystems API to just create files. You'll want to use a reshuffle + a
> fixed number of keys where the fixed number of keys controls the write
> parallelization. Pipeline would look like:
> --> ParDo(assign deterministic random key in fixed key space) -->
> Reshuffle --> ParDo(10k buffering StatefulDoFn)
> You'll need to provide more details around any other constraints you have
> around writing 10k.
> 1:
> On Mon, Nov 26, 2018 at 9:39 AM Vinay Patil <>
> wrote:
>> Hi,
>> I have a use case of writing only 10k in a single file after which a new
>> file should be created.
>> Is there a way in Beam to roll a file after the specified number of
>> records ?
>> Regards,
>> Vinay Patil

View raw message