flink-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Fabian Hueske (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (FLINK-8601) Introduce LinkedBloomFilterState for Approximate calculation and other situations of performance optimization
Date Thu, 08 Feb 2018 08:59:00 GMT

    [ https://issues.apache.org/jira/browse/FLINK-8601?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16356673#comment-16356673

Fabian Hueske commented on FLINK-8601:

It's good to know that you're using this in production.

The design document suggests it's similar to MapState, ValueState, etc. because it uses the
same terms (xyzState, xyzStateDescriptor, getxyzState, etc.). If this is not the same as the
other states, I'd suggest to rename these concepts. I'd also add an Implementation section
to the design document that explains which changes will be done.

Thanks, Fabian

> Introduce LinkedBloomFilterState for Approximate calculation and other situations of
performance optimization
> -------------------------------------------------------------------------------------------------------------
>                 Key: FLINK-8601
>                 URL: https://issues.apache.org/jira/browse/FLINK-8601
>             Project: Flink
>          Issue Type: New Feature
>          Components: Core, DataStream API
>    Affects Versions: 1.4.0
>            Reporter: Sihua Zhou
>            Assignee: Sihua Zhou
>            Priority: Major
> h3. Backgroud
> Bloom filter is useful in many situation, for example:
>  * 1. Approximate calculation: deduplication (eg: UV calculation)
>  * 2. Performance optimization: eg, [runtime filter join|https://www.cloudera.com/documentation/enterprise/5-9-x/topics/impala_runtime_filtering.html]
> However, based on the current status provided by flink, it is hard to use the bloom filter
for the following reasons:
>  * 1. Serialization problem: Bloom filter status can be large (for example: 100M), if
implement it based on the RocksDB state, the state data will need to be serialized each time
it is queried and updated, and the performance will be very poor.
>  * 2. Data skewed: Data in different key group can be skewed, and the information of
data skewed can not be accurately predicted before the program is running. Therefore, it is
impossible to determine how much resources bloom filter should allocate. One way to do this
is to allocate space needed for the most skewed case, but this can lead to very serious waste
of resources.
> h3. Requirement
> Therefore, I introduce the LinkedBloomFilterState for flink, which at least need to meet
the following features:
>  * 1. Support for changing Parallelism
>  * 2. Only serialize when necessary: when performing checkpoint
>  * 3. Can deal with data skew problem: users only need to specify a LinkedBloomFilterState
with the desired input, fpp, system will allocate resource dynamic.
>  * 4. Do not conflict with other state: user can use KeyedState and OperateState when
using bloom filter state.
>  * 5. Support relax ttl (ie: the data survival time at least greater than the specified
> Design doc:  [design doc|https://docs.google.com/document/d/1yMCT2ogE0CtSjzRvldgi0ZPPxC791PpkVGkVeqaUUI8/edit?usp=sharing]

This message was sent by Atlassian JIRA

View raw message