beam-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Vinay Patil <vinay18.pa...@gmail.com>
Subject Re: Rolling File Query
Date Mon, 26 Nov 2018 20:35:28 GMT
Hi Lukasz,

Thank you for your reply. I understood the part of buffering the elements
however I did not clearly understand this: "a fixed number of keys where
the fixed number of keys controls the write parallelization"  Let's say if
have only 4 keys then 4 files will be created but I need to create new file
when the buffer size is reached (10k), so it could happen that there are
more than 4 files created, correct ?

Using Existing FileSystem API - you mean without using existing FileIO,
right ?

I was looking for a API similar to BucketingSink API of Flink where it
rolls to a new file after batch size is reached.


Regards,
Vinay Patil


On Mon, Nov 26, 2018 at 1:04 PM Lukasz Cwik <lcwik@google.com> wrote:

> You could use a StatefulDoFn to buffer up 10k worth of data and write it.
> There is an example for batched RPC[1] that could be re-used using the
> FileSystems API to just create files. You'll want to use a reshuffle + a
> fixed number of keys where the fixed number of keys controls the write
> parallelization. Pipeline would look like:
>
> --> ParDo(assign deterministic random key in fixed key space) -->
> Reshuffle --> ParDo(10k buffering StatefulDoFn)
>
> You'll need to provide more details around any other constraints you have
> around writing 10k.
>
> 1: https://beam.apache.org/blog/2017/08/28/timely-processing.html
>
>
> On Mon, Nov 26, 2018 at 9:39 AM Vinay Patil <vinay18.patil@gmail.com>
> wrote:
>
>> Hi,
>>
>> I have a use case of writing only 10k in a single file after which a new
>> file should be created.
>>
>> Is there a way in Beam to roll a file after the specified number of
>> records ?
>>
>> Regards,
>> Vinay Patil
>>
>

Mime
View raw message