flink-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ufuk Celebi <...@apache.org>
Subject Re: [VOTE] Release Apache Flink 1.1.4 (RC3)
Date Fri, 16 Dec 2016 10:05:51 GMT
If the memory consumption behaviour changed since 1.1.3 I think that
this is a blocker for the release. I would not like users to upgrade
their Flink installation and all of a sudden run into killed
containers. I would not blindly change the default memory fraction
without understanding the root cause of this.

The other issues you found are nice to fix, but can also happen via 1.1.5 imo.

I checked the following so far:
- Files are cleaned up on checkpoint failures (FS backend)
- Streams are eagerly closed on cancellation (FS backend)
- Non-recoverable jobs in ZooKeeper are skipped and a warning is given
- StreamingStateMachine job with HA (killing TMs and JM)
- Manually compared stream consumption behaviour with 1.1.3 and RC2
(comparing the "fairness")

Especially the last fix is very important and should get out as soon
as possible.

On Fri, Dec 16, 2016 at 10:43 AM, Robert Metzger <rmetzger@apache.org> wrote:
> I'm not sure if we can release the release candidate like this, because I'm
> running into two issues probably related to a recent rocksdb version
> upgrade.
> This is my list of points so far:
> - Checked the staging repository. Quickstarts and Hadoop 1 / 2 are okay.
> - Build a job against the staging repository
> - Binaries deploy on a kerberized HA YARN / HDFS setup. Ran the KMeans and
> WordCount batch jobs
> - Executed a heavy, misbehaved streaming job for a few hours. While running
> that job, I found that:
>   - Not all checkpoint directories are cleaned up in HDFS (I use the async
> rocksdb statebackend)
>   -  segfaults from rocksdb (8 segfaults in ~3 hrs, but they were all
> happening in the last minutes)
>   - "beyond physical memory limits" container killings from YARN (I know we
> can configure this, I just wonder what if we should change the default
> value)
>   -  the segfaults and memory limits caused the job to not run anymore in
> the end because it was in a constant retry loop.
>   - This is not a blocking issue I found during the testing:
> https://issues.apache.org/jira/browse/FLINK-5345
>   - This is also a non blocking issue for 1.1.4 (fixed for 1.2)
> https://issues.apache.org/jira/browse/FLINK-4631
> Let me know if we should release anyways or fix these issues first.
> On Tue, Dec 13, 2016 at 11:04 PM, Ufuk Celebi <uce@apache.org> wrote:
>> Dear Flink community,
>> Please vote on releasing the following candidate as Apache Flink version
>> 1.1.4.
>> The commit to be voted on:
>> 2cd6579 (http://git-wip-us.apache.org/repos/asf/flink/commit/2cd6579)
>> Branch:
>> release-1.1.4-rc3
>> (https://git1-us-west.apache.org/repos/asf/flink/repo?p=flin
>> k.git;a=shortlog;h=refs/heads/release-1.1.4-rc3)
>> The release artifacts to be voted on can be found at:
>> http://people.apache.org/~uce/flink-1.1.4-rc3/
>> The release artifacts are signed with the key with fingerprint 9D403309:
>> http://www.apache.org/dist/flink/KEYS
>> The staging repository for this release can be found at:
>> https://repository.apache.org/content/repositories/orgapacheflink-1109
>> -------------------------------------------------------------
>> The voting time is at least three days and the vote passes if a
>> majority of at least three +1 PMC votes are cast. The vote ends earliest
>> on Friday, December 16th, 2016, at 11 PM (CET)/2 PM (PST).
>> [ ] +1 Release this package as Apache Flink 1.1.4
>> [ ] -1 Do not release this package, because ...

View raw message