beam-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Robert Bradshaw (JIRA)" <>
Subject [jira] [Commented] (BEAM-4858) Clean up _BatchSizeEstimator in element-batching transform.
Date Wed, 10 Oct 2018 13:53:00 GMT


Robert Bradshaw commented on BEAM-4858:

Submitted pr/6375.
 * Simplifies the aging out of old data. Not only was the old formula hard to
understand, but it meant that bad data could stick around and poison the
estimates forever.
 * Adds a variance parameter allowing the batch size to vary over a fixed
range giving a broader base for the linear regression.
 * Uses numpy when available to do the regression. This is both much more
efficient and allows for less error-prone expression of more complicated

The algorithm was also changed to:
 * Eliminates outliers, both using Cook's distance and just throwing out the
top (often high-variance and high-influence) 20% when there is sufficient
 * Weight by the inverse of batch size, to provide a more stable fixed size
estimate (which the default "overhead" target is sensitive to).

These changes were tested against a large TFT pipeline and found to produce
more uniform batch sizes and similar, possibly slightly improved, total
runtimes and total costs.

> Clean up _BatchSizeEstimator in element-batching transform.
> -----------------------------------------------------------
>                 Key: BEAM-4858
>                 URL:
>             Project: Beam
>          Issue Type: Bug
>          Components: sdk-py-core
>            Reporter: Valentyn Tymofieiev
>            Assignee: Robert Bradshaw
>            Priority: Minor
>             Fix For: 2.8.0
>          Time Spent: 5h 40m
>  Remaining Estimate: 0h
> Beam Python 3 conversion [exposed|] non-trivial
performance-sensitive logic in element-batching transform. Let's take a look at [|].

> Due to Python 2 language semantics, the result of {{x2 / x1}} will depend on the type
of the keys - whether they are integers or floats. 
> The keys of key-value pairs contained in {{self._data}} are added as integers [here|],
however, when we 'thin' the collected entries [here|],
the keys will become floats. Surprisingly, using either integer or float division consistently
[in the comparator|]
 negatively affects the performance of a custom pipeline I was using to benchmark these changes.
The performance impact likely comes from changes in the logic that depends on  how division
is evaluated, not from the performance of division operation itself.
> In terms of Python 3 conversion the best course of action that avoids regression seems
to be to preserve the existing Python 2 behavior using {{old_div}} from {{past.utils.division}},
in the medium term we should clean up the logic. We may want to add a targeted microbenchmark
to evaluate performance of this code, and maybe cythonize the code, since it seems to be performance-sensitive.

This message was sent by Atlassian JIRA

View raw message