hadoop-common-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Allen Wittenauer (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HADOOP-7550) Need for Integrity Validation of RPC
Date Fri, 19 Aug 2011 00:48:29 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-7550?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13087431#comment-13087431

Allen Wittenauer commented on HADOOP-7550:

>From what I remember, krb5 vs krb5i was like 5-10% perf degradation. krb5p was like another
5%. I'd expect going from nothing to krb5i or krb5p to be fairly horrific.  On the plus side,
these are already implemented, known quantities, etc.  With hardware accelerated crypto now
common, the numbers are likely lower for anyone using anything relatively modern on non-Intel
gear.  For Intel-gear, enabling AES support would probably help.

> Need for Integrity Validation of RPC
> ------------------------------------
>                 Key: HADOOP-7550
>                 URL: https://issues.apache.org/jira/browse/HADOOP-7550
>             Project: Hadoop Common
>          Issue Type: Improvement
>          Components: ipc
>            Reporter: Dave Thompson
>            Assignee: Dave Thompson
> Some recent investigation of network packet corruption has shown a need for hadoop RPC
integrity validation beyond assurances already provided by 802.3 link layer and TCP 16-bit
> During an unusual occurrence on a 4k node cluster, we've seen as high as 4 TCP anomalies
per second on a single node, sustained over an hour (14k per hour).   A TCP anomaly  would
be an escaped link layer packet that resulted in a TCP CRC failure, TCP packet out of sequence
> or TCP packet size error.
> According to this paper[*]:  http://tinyurl.com/3aue72r
> TCP's 16-bit CRC has an effective detection rate of 2^10.   1 in 1024 errors may escape
detection, and in fact what originally alerted us to this issue was seeing failures due to
bit-errors in hadoop traffic.  Extrapolating from that paper, one might expect 14 escaped
packet errors per hour for that single node of a 4k cluster.  While the above error rate
> was unusually high due to a broadband aggregate switch issue, hadoop not having an integrity
check on RPC makes it problematic to discover, and limit any potential data damage due to
> acting on a corrupt RPC message.
> ------
> [*] In case this jira outlives that tinyurl, the IEEE paper cited is:  "Performance of
Checksums and CRCs over Real Data" by Jonathan Stone, Michael Greenwald, Craig Partridge,
Jim Hughes.

This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


View raw message