hadoop-common-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Dave Thompson (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HADOOP-7550) Need for Integrity Validation of RPC
Date Mon, 19 Sep 2011 16:49:09 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-7550?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13107967#comment-13107967

Dave Thompson commented on HADOOP-7550:

Yes, the ideal solution for this use case, due to performance considerations, would not be
to use a cryptographically secure checksum, as is specified in the RFC1964 Kerberos GSS-API
SASL implementation that Sun provides, but rather something along the lines of a CRC-32 that
encapsulates the entire RPC.  I agree that an on/off mechanism should be included in the implementation,
for performance considerations as you mentioned, and also because such data integrity check
would be wastefully redundant if someone had need to deploy with something like QoP auth-conf
or auth-int secure SASL option.

> Need for Integrity Validation of RPC
> ------------------------------------
>                 Key: HADOOP-7550
>                 URL: https://issues.apache.org/jira/browse/HADOOP-7550
>             Project: Hadoop Common
>          Issue Type: Improvement
>          Components: ipc
>            Reporter: Dave Thompson
>            Assignee: Dave Thompson
> Some recent investigation of network packet corruption has shown a need for hadoop RPC
integrity validation beyond assurances already provided by 802.3 link layer and TCP 16-bit
> During an unusual occurrence on a 4k node cluster, we've seen as high as 4 TCP anomalies
per second on a single node, sustained over an hour (14k per hour).   A TCP anomaly  would
be an escaped link layer packet that resulted in a TCP CRC failure, TCP packet out of sequence
> or TCP packet size error.
> According to this paper[*]:  http://tinyurl.com/3aue72r
> TCP's 16-bit CRC has an effective detection rate of 2^10.   1 in 1024 errors may escape
detection, and in fact what originally alerted us to this issue was seeing failures due to
bit-errors in hadoop traffic.  Extrapolating from that paper, one might expect 14 escaped
packet errors per hour for that single node of a 4k cluster.  While the above error rate
> was unusually high due to a broadband aggregate switch issue, hadoop not having an integrity
check on RPC makes it problematic to discover, and limit any potential data damage due to
> acting on a corrupt RPC message.
> ------
> [*] In case this jira outlives that tinyurl, the IEEE paper cited is:  "Performance of
Checksums and CRCs over Real Data" by Jonathan Stone, Michael Greenwald, Craig Partridge,
Jim Hughes.

This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


View raw message