mesos-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ian Downes (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (MESOS-4105) Network isolator causes corrupt packets to reach application
Date Wed, 09 Dec 2015 23:16:11 GMT

     [ https://issues.apache.org/jira/browse/MESOS-4105?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Ian Downes updated MESOS-4105:
------------------------------
    Description: 
The optional network isolator (network/port_mapping) will let corrupt TCP packets reach the
application. This could lead to data corruption in applications. Normally these packets are
dropped immediately by the network stack and do not reach the application. 

Networks may have a very low level of corrupt packets (a few per million) or, may have very
high levels if there are hardware or software errors in networking equipment.

1) We receive a corrupt packet externally
2) The hardware driver is able to checksum it and notices it has a bad checksum
3) The driver delivers this packet anyway to wait for TCP layer to checksum it again and then
drop it
4) This packet is moved to a veth interface because it is for a container
5) Both sides of the veth pair have RX checksum offloading enabled by default
6) The veth_xmit() marks the packet's checksum as UNNECESSARY since its peer device has rx
checksum offloading
7) Packet is moved into the container TCP/IP stack
8) TCP layer is not going to checksum it since it is not necessary
9) The packet gets delivered to application layer


  was:
The optional network isolator (network/port_mapping) will let corrupt TCP packets reach the
application. This could lead to data corruption in applications. Normally these packets are
dropped immediately by the network stack and do not reach the application. 

Networks may have a very low level of corrupt packets (a few per million) or, may have very
high levels if there are hardware or software errors in networking equipment.

Investigation is ongoing but an initial hypothesis is being tested:
1) The checksum error is correctly detected by the host interface.
2) The Mesos tc filters used by the network isolator redirect the packet to the virtual interface,
even when a checksum error has occurred.
3) Either in copying to the veth device or passing across the veth pipe the checksum flag
is cleared.
4) The veth inside the container does not verify the checksum, even though TCP RX checksum
offloading is supposedly on. \[This is hypothesized to be acceptable normally because it's
receiving packets over the virtual link where corruption should not occur\] 
5) The container network stack accepts the packet and delivers it to the application.

Disabling tcp rx cso on the container veth appears to fix this: it forces the container network
stack to compute the packet checksums (in software) whereby it detects the checksum errors
and does not deliver the packet to the application.


> Network isolator causes corrupt packets to reach application
> ------------------------------------------------------------
>
>                 Key: MESOS-4105
>                 URL: https://issues.apache.org/jira/browse/MESOS-4105
>             Project: Mesos
>          Issue Type: Bug
>          Components: isolation
>    Affects Versions: 0.20.0, 0.20.1, 0.21.0, 0.21.1, 0.21.2, 0.22.0, 0.22.1, 0.22.2,
0.23.0, 0.23.1, 0.24.0, 0.24.1, 0.25.0
>            Reporter: Ian Downes
>            Assignee: Cong Wang
>            Priority: Critical
>
> The optional network isolator (network/port_mapping) will let corrupt TCP packets reach
the application. This could lead to data corruption in applications. Normally these packets
are dropped immediately by the network stack and do not reach the application. 
> Networks may have a very low level of corrupt packets (a few per million) or, may have
very high levels if there are hardware or software errors in networking equipment.
> 1) We receive a corrupt packet externally
> 2) The hardware driver is able to checksum it and notices it has a bad checksum
> 3) The driver delivers this packet anyway to wait for TCP layer to checksum it again
and then drop it
> 4) This packet is moved to a veth interface because it is for a container
> 5) Both sides of the veth pair have RX checksum offloading enabled by default
> 6) The veth_xmit() marks the packet's checksum as UNNECESSARY since its peer device has
rx checksum offloading
> 7) Packet is moved into the container TCP/IP stack
> 8) TCP layer is not going to checksum it since it is not necessary
> 9) The packet gets delivered to application layer



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message