hadoop-hdfs-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Dave Thompson (JIRA)" <j...@apache.org>
Subject [jira] [Created] (HDFS-2269) Need for Integrity Validation of RPC
Date Wed, 17 Aug 2011 19:33:27 GMT
Need for Integrity Validation of RPC

                 Key: HDFS-2269
                 URL: https://issues.apache.org/jira/browse/HDFS-2269
             Project: Hadoop HDFS
          Issue Type: Improvement
          Components: data-node, name-node
            Reporter: Dave Thompson

Some recent investigation of network packet corruption has shown a need for hadoop RPC integrity
validation beyond assurances already provided by 802.3 link layer and TCP 16-bit CRC.

During an unusual occurrence on a 4k node cluster, we've seen as high as 4 TCP anomalies per
second on a single node, sustained over an hour (14k per hour).   A TCP anomaly  would be
an escaped link layer packet that resulted in a TCP CRC failure, TCP packet out of sequence
or TCP packet size error.

According to this paper[*]:  http://tinyurl.com/3aue72r
TCP's 16-bit CRC has an effective detection rate of 2^10.   1 in 1024 errors may escape detection,
and in fact what originally alerted us to this issue was seeing failures due to bit-errors
in hadoop traffic.  Extrapolating from that paper, one might expect 14 escaped packet errors
per hour for that single node of a 4k cluster.  While the above error rate
was unusually high due to a broadband aggregate switch issue, hadoop not having an integrity
check on RPC makes it problematic to discover, and limit any potential data damage due to
acting on a corrupt RPC message.

[*] In case this jira outlives that tinyurl, the IEEE paper cited is:  "Performance of Checksums
and CRCs over Real Data" by Jonathan Stone, Michael Greenwald, Craig Partridge, Jim Hughes.

This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


View raw message