beam-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Eugene Kirpichov (JIRA)" <>
Subject [jira] [Closed] (BEAM-3499) Watch can make no progress if a single poll takes more than checkpoint interval
Date Tue, 06 Feb 2018 19:48:00 GMT


Eugene Kirpichov closed BEAM-3499.
       Resolution: Fixed
    Fix Version/s: 2.3.0

> Watch can make no progress if a single poll takes more than checkpoint interval
> -------------------------------------------------------------------------------
>                 Key: BEAM-3499
>                 URL:
>             Project: Beam
>          Issue Type: Bug
>          Components: sdk-java-core
>            Reporter: Eugene Kirpichov
>            Assignee: Eugene Kirpichov
>            Priority: Major
>             Fix For: 2.3.0
>          Time Spent: 20m
>  Remaining Estimate: 0h
> E.g. when using it to poll a filepattern with hundreds of thousands of files, a single
poll may take >10 seconds (default checkpoint interval in OutputAndTimeBoundedSplittableProcessElementInvoker).
Because of that, the tracker (GrowthTracker) gets checkpointed before anything is added to
it, i.e. before [,] at
a moment when it doesn't contain any useful information, so the residual checkpoint state
is as empty as the initial one. When we resume from the residual checkpoint, the situation
simply repeats - until we get lucky enough to either take <10s to poll, or to not be asked
to checkpoint for >10s (e.g. cause the checkpointing thread isn't scheduled).
> One possible fix to this is to change the SDF checkpointing strategy to have a progress
guarantee: e.g., start counting time from the moment the first block is claimed, or allow
the tracker to refuse checkpointing if nothing is claimed yet, or something like that.
> A workaround for users of this (primarily via FileIO.match().continuously()) is to shard
their filepattern into a set of finer-granularity filepatterns matching fewer files, so that
each match call takes less than 10 seconds.

This message was sent by Atlassian JIRA

View raw message