flink-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Robert Metzger (JIRA)" <j...@apache.org>
Subject [jira] [Created] (FLINK-5874) Reject arrays as keys in DataStream API to avoid inconsistent hashing
Date Tue, 21 Feb 2017 17:36:44 GMT
Robert Metzger created FLINK-5874:
-------------------------------------

             Summary: Reject arrays as keys in DataStream API to avoid inconsistent hashing
                 Key: FLINK-5874
                 URL: https://issues.apache.org/jira/browse/FLINK-5874
             Project: Flink
          Issue Type: Bug
          Components: DataStream API
    Affects Versions: 1.1.4, 1.2.0
            Reporter: Robert Metzger
            Priority: Blocker


This issue has been reported on the mailing list twice:
- http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Previously-working-job-fails-on-Flink-1-2-0-td11741.html
- http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Arrays-values-in-keyBy-td7530.html

The problem is the following: We are using just Key[].hashCode() to compute the hash when
shuffling data. Java's default hashCode() implementation doesn't take the arrays contents
into account, but the memory address.
This leads to different hash code on the sender and receiver side.
In Flink 1.1 this means that the data is shuffled randomly and not keyed, and in Flink 1.2
the keygroups code detect a violation of the hashing.

The proper fix of the problem would be to rely on Flink's {{TypeComparator}} class, which
has a type-specific hashing function. But introducing this change would break compatibility
with existing code.
I'll file a JIRA for the 2.0 changes for that fix.

For 1.2.1 and 1.3.0 we should at least reject arrays as keys.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Mime
View raw message