hadoop-common-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Aaron T. Myers (JIRA)" <j...@apache.org>
Subject [jira] [Moved] (HADOOP-7550) Need for Integrity Validation of RPC
Date Wed, 17 Aug 2011 20:45:27 GMT

     [ https://issues.apache.org/jira/browse/HADOOP-7550?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Aaron T. Myers moved HDFS-2269 to HADOOP-7550:
----------------------------------------------

    Component/s:     (was: data-node)
                     (was: name-node)
                 ipc
            Key: HADOOP-7550  (was: HDFS-2269)
        Project: Hadoop Common  (was: Hadoop HDFS)

> Need for Integrity Validation of RPC
> ------------------------------------
>
>                 Key: HADOOP-7550
>                 URL: https://issues.apache.org/jira/browse/HADOOP-7550
>             Project: Hadoop Common
>          Issue Type: Improvement
>          Components: ipc
>            Reporter: Dave Thompson
>
> Some recent investigation of network packet corruption has shown a need for hadoop RPC
integrity validation beyond assurances already provided by 802.3 link layer and TCP 16-bit
CRC.
> During an unusual occurrence on a 4k node cluster, we've seen as high as 4 TCP anomalies
per second on a single node, sustained over an hour (14k per hour).   A TCP anomaly  would
be an escaped link layer packet that resulted in a TCP CRC failure, TCP packet out of sequence
> or TCP packet size error.
> According to this paper[*]:  http://tinyurl.com/3aue72r
> TCP's 16-bit CRC has an effective detection rate of 2^10.   1 in 1024 errors may escape
detection, and in fact what originally alerted us to this issue was seeing failures due to
bit-errors in hadoop traffic.  Extrapolating from that paper, one might expect 14 escaped
packet errors per hour for that single node of a 4k cluster.  While the above error rate
> was unusually high due to a broadband aggregate switch issue, hadoop not having an integrity
check on RPC makes it problematic to discover, and limit any potential data damage due to
> acting on a corrupt RPC message.
> ------
> [*] In case this jira outlives that tinyurl, the IEEE paper cited is:  "Performance of
Checksums and CRCs over Real Data" by Jonathan Stone, Michael Greenwald, Craig Partridge,
Jim Hughes.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message