beam-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Eugene Kirpichov (JIRA)" <>
Subject [jira] [Assigned] (BEAM-1841) FileBasedSource should better handle when set of files grows while job is running
Date Thu, 30 Mar 2017 18:11:41 GMT


Eugene Kirpichov reassigned BEAM-1841:

    Assignee: Eugene Kirpichov  (was: Davor Bonaci)

> FileBasedSource should better handle when set of files grows while job is running
> ---------------------------------------------------------------------------------
>                 Key: BEAM-1841
>                 URL:
>             Project: Beam
>          Issue Type: Bug
>          Components: runner-dataflow, sdk-java-core
>            Reporter: Eugene Kirpichov
>            Assignee: Eugene Kirpichov
> In some cases people run pipelines over directories where the set of files in the directory
grows while the job runs. This may lead to a situation like this, in particular with Dataflow
> At job submission time, the FileBasedSource estimates the current size of the filepattern,
and ends up with a small number. Dataflow runner chooses thus a small desiredBundleSizeBytes
to pass to .splitIntoBundles(). However, at the time splitIntoBundles() runs, the set of files
has greatly grown, and we produce many more, unnecessarily small bundles, than anticipated.
> I see a few things we could do:
> - In splitIntoBundles(), compute the actual size and detect when the desired size is
unreasonably small for it; e.g. set an upper threshold on how many bundles we produce in total.
> - Somehow remember, at submission time, what was the estimated size. Then, in splitIntoBundles(),
compute the actual current size, and scale desiredBundleSizeBytes accordingly to get approximately
the intended number of bundles. Caveat: files may still change between the moment size is
estimated and the moment splitting happens.
> - (much larger in scope) Change the whole protocol to use number of bundles instead of
bundle size bytes. This probably won't happen with BoundedSource, but it is going to be the
case with Splittable DoFn.
> Option 1 seems by far the simplest.

This message was sent by Atlassian JIRA

View raw message