flink-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Aljoscha Krettek <aljos...@apache.org>
Subject Re: [DISCUSS] Release testing procedures, Flink 1.3.2
Date Wed, 19 Jul 2017 15:26:59 GMT
Hi,

Yes! In my opinion, the most critical issues are these:

 - https://issues.apache.org/jira/browse/FLINK-6964: <https://issues.apache.org/jira/browse/FLINK-6964:>
Fix recovery for incremental checkpoints in StandaloneCompletedCheckpointStore
 - https://issues.apache.org/jira/browse/FLINK-7041: <https://issues.apache.org/jira/browse/FLINK-7041:>
Deserialize StateBackend from JobCheckpointingSettings with user classloader

The first one makes incremental checkpoints on RocksDB unusable with externalised checkpoints.
The latter means that you cannot have custom configuration of the RocksDB backend.

 - https://issues.apache.org/jira/browse/FLINK-7216: <https://issues.apache.org/jira/browse/FLINK-7216:>
ExecutionGraph can perform concurrent global restarts to scheduling
 - https://issues.apache.org/jira/browse/FLINK-7153: <https://issues.apache.org/jira/browse/FLINK-7153:>
Eager Scheduling can't allocate source for ExecutionGraph correctly

These are critical scheduler bugs, Stephan can probably say more about them than I can.

 - https://issues.apache.org/jira/browse/FLINK-7143: <https://issues.apache.org/jira/browse/FLINK-7143:>
Partition assignment for Kafka consumer is not stable
 - https://issues.apache.org/jira/browse/FLINK-7195: <https://issues.apache.org/jira/browse/FLINK-7195:>
FlinkKafkaConsumer should not respect fetched partitions to filter restored partition states
 - https://issues.apache.org/jira/browse/FLINK-6996: <https://issues.apache.org/jira/browse/FLINK-6996:>
FlinkKafkaProducer010 doesn't guarantee at-least-once semantic

The first one means that you can have duplicate data because several consumers would be consuming
from one partition, without noticing it. The second one causes partitions to be dropped from
state if a broker is temporarily not reachable.

The first two issues would have been caught by my proposed testing procedures, the third and
fourth might be caught but are very tricky to provoke. I’m currently experimenting with
this testing procedure to Flink 1.3.1 to see if I can provoke it.

The Kafka bugs are super hard to provoke because they only occur if Kafka has some temporary
problems or there are communication problems.

I forgot to mention that I have actually two goals with this: 1) thoroughly test Flink and
2) build expertise in the community, i.e. we’re forced to try cluster environments/distributions
that we are not familiar with and we actually deploy a full job and play around with it.

Best,
Aljoscha


> On 19. Jul 2017, at 15:49, Shaoxuan Wang <shaoxuan@apache.org> wrote:
> 
> Hi Aljoscha,
> Glad to see that we have a more thorough testing procedure. Could you
> please share us what (the critical issues you mentioned) have been broken
> in 1.3.0 & 1.3.1, and how the new proposed "functional testing section and
> a combination of systems/configurations" can cover this. This will help us
> to improve our production verification as well.
> 
> Regards,
> Shaoxuan
> 
> 
> On Wed, Jul 19, 2017 at 9:11 PM, Aljoscha Krettek <aljoscha@apache.org>
> wrote:
> 
>> Hi Everyone,
>> 
>> We are on the verge of starting the release process for Flink 1.3.2 and
>> there have been some critical issues in both Flink 1.3.0 and 1.3.1. For
>> Flink 1.3.2 I want to make very sure that we test as much as possible. For
>> this I’m proposing a slightly changed testing procedure [1]. This is
>> similar to the testing document we used for previous releases but has a new
>> functional testing section that tries to outline a testing procedure and a
>> combination of systems/configurations that we have to test. Please have a
>> look and comment on whether you think this is sufficient (or a bit too
>> much).
>> 
>> What do you think?
>> 
>> Best,
>> Aljoscha
>> 
>> [1] https://docs.google.com/document/d/16fU1cpxoYf3o9cCDyakj7ZDnUoJTj
>> 4_CEmMTpCkY81s/edit?usp=sharing


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message