cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Peter Schuller (Commented) (JIRA)" <>
Subject [jira] [Commented] (CASSANDRA-3838) Repair Streaming hangs between multiple regions
Date Sun, 05 Feb 2012 01:24:54 GMT


Peter Schuller commented on CASSANDRA-3838:

Let me be more clear about why keep-alive is better.

TCP keep-alive is at the transport level, and thus independent of in-band data (or lack thereof).
Imagine that you're implementing a remote procedure call protocol where the client sends:

INVOKE name-of-process arg1 arg2

The server invokes the method, and responds:

RET success|failure exit-value|exception

The first thing you need if you are using this in some kind of production scenario, is to
ensure that requests can time out. But there is a problem. Suppose you're making the assumption
that this software is running on well-connected networks and a high number of requests per
second; there is no reason to not quickly time out requests if the remote host is unreachable.
So you set a socket timeout to 1 second. The only problem is that it will also time out on
all requests that take longer than 1 second because the method call legitimately took longer.

The conflict happens because the selection of timeout was made based on the transport level
circumstances (fast local network, high throughput, no need to wait if a host is down) while
the effect of the timeout is at the in-band data level and is thus triggered by a slow request.

One way to fix this is to extend the protocol between client and server such that they can
constantly be exchanging PING/PONG type messages (witness IRC for an example of this). This
allows you to utilize socket (or read/write op) timeouts to detect a broken transport, under
the assumption/premise that both sides have dedicated code for the ping/pong stuff which is
independent of any delay in processing the otherwise in-band data.

Disadvantages of this approach can include the need to actually change the protocol, and (depending
on implementation) additional implementation complexity as you suddenly need to actively model
the transport as such.

TCP keep-alive is a way to let the kernel/tcp, which is already supposed to support this,
deal with this without adding complexity to the application. It allows what effectively boils
down to a "timeout" at the transport level which can be selected based on use-case and expected
networking characteristics, and is independent of the nature of the in-band data sent over
that transport.

In the Cassandra case, the equivalent of the slow RPC call might be that a write() during
streaming blocks for 5 seconds because socket buffers on both ends are full, and the other
end is going a GC or waiting on an fsync().

By using keep-alives we get more "correct" behavior in that such blocks won't cause connection
tear-downs, while at the same time not having to change the protocol and/or add complexity
to the code base to implement a protocol-within-tcp in which to mux the actual payload for

> Repair Streaming hangs between multiple regions
> -----------------------------------------------
>                 Key: CASSANDRA-3838
>                 URL:
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 1.0.7
>            Reporter: Vijay
>            Assignee: Vijay
>            Priority: Minor
>             Fix For: 1.0.8
>         Attachments: 0001-Add-streaming-socket-timeouts.patch
> Streaming hangs between datacenters, though there might be multiple reasons for this,
a simple fix will be to add the Socket timeout so the session can retry.
> The following is the netstat of the affected node (the below output remains this way
for a very long period).
> [test_abrepairtest@test_abrepair--euwest1c-i-1adfb753 ~]$ nt netstats
> Mode: NORMAL
> Streaming to: /
>    /mnt/data/cassandra070/data/abtests/cust_allocs-hc-2221-Data.db sections=7002 progress=1523325354/2475291786
- 61%
>    /mnt/data/cassandra070/data/abtests/cust_allocs-hc-2233-Data.db sections=4581 progress=0/595026085
- 0%
>    /mnt/data/cassandra070/data/abtests/cust_allocs-g-2235-Data.db sections=6631 progress=0/2270344837
- 0%
>    /mnt/data/cassandra070/data/abtests/cust_allocs-hc-2239-Data.db sections=6266 progress=0/2190197091
- 0%
>    /mnt/data/cassandra070/data/abtests/cust_allocs-hc-2230-Data.db sections=7662 progress=0/3082087770
- 0%
>    /mnt/data/cassandra070/data/abtests/cust_allocs-hc-2240-Data.db sections=7874 progress=0/587439833
- 0%
>    /mnt/data/cassandra070/data/abtests/cust_allocs-g-2226-Data.db sections=7682 progress=0/2933920085
- 0%
> "Streaming:1" daemon prio=10 tid=0x00002aaac2060800 nid=0x1676 runnable [0x000000006be85000]
>    java.lang.Thread.State: RUNNABLE
>         at Method)
>         at
>         at
>         at
>         at
>         at
>         at
>         at
>         - locked <0x00000006afea1bd8> (a
>         at com.ning.compress.lzf.ChunkEncoder.encodeAndWriteChunk(
>         at com.ning.compress.lzf.LZFOutputStream.writeCompressedBlock(
>         at com.ning.compress.lzf.LZFOutputStream.flush(
>         at
>         at org.apache.cassandra.streaming.FileStreamTask.runMayThrow(
>         at
>         at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(
>         at java.util.concurrent.ThreadPoolExecutor$
>         at
> Streaming from: /
>    abtests: /mnt/data/cassandra070/data/abtests/cust_allocs-hc-2241-Data.db sections=7231
progress=0/1548922508 - 0%
>    abtests: /mnt/data/cassandra070/data/abtests/cust_allocs-hc-2231-Data.db sections=4730
progress=0/296474156 - 0%
>    abtests: /mnt/data/cassandra070/data/abtests/cust_allocs-hc-2244-Data.db sections=7650
progress=0/1580417610 - 0%
>    abtests: /mnt/data/cassandra070/data/abtests/cust_allocs-hc-2217-Data.db sections=7682
progress=0/196689250 - 0%
>    abtests: /mnt/data/cassandra070/data/abtests/cust_allocs-hc-2220-Data.db sections=7149
progress=0/478695185 - 0%
>    abtests: /mnt/data/cassandra070/data/abtests/cust_allocs-hc-2171-Data.db sections=443
progress=0/78417320 - 0%
>    abtests: /mnt/data/cassandra070/data/abtests/cust_allocs-g-2235-Data.db sections=6631
progress=0/2270344837 - 0%
>    abtests: /mnt/data/cassandra070/data/abtests/cust_allocs-hc-2222-Data.db sections=4590
progress=0/1310718798 - 0%
>    abtests: /mnt/data/cassandra070/data/abtests/cust_allocs-hc-2233-Data.db sections=4581
progress=0/595026085 - 0%
>    abtests: /mnt/data/cassandra070/data/abtests/cust_allocs-g-2226-Data.db sections=7682
progress=0/2933920085 - 0%
>    abtests: /mnt/data/cassandra070/data/abtests/cust_allocs-hc-2213-Data.db sections=7876
progress=0/3308781588 - 0%
>    abtests: /mnt/data/cassandra070/data/abtests/cust_allocs-hc-2216-Data.db sections=7386
progress=0/2868167170 - 0%
>    abtests: /mnt/data/cassandra070/data/abtests/cust_allocs-hc-2240-Data.db sections=7874
progress=0/587439833 - 0%
>    abtests: /mnt/data/cassandra070/data/abtests/cust_allocs-hc-2254-Data.db sections=4618
progress=0/215989758 - 0%
>    abtests: /mnt/data/cassandra070/data/abtests/cust_allocs-hc-2221-Data.db sections=7002
progress=1542191546/2475291786 - 62%
>    abtests: /mnt/data/cassandra070/data/abtests/cust_allocs-hc-2239-Data.db sections=6266
progress=0/2190197091 - 0%
>    abtests: /mnt/data/cassandra070/data/abtests/cust_allocs-hc-2210-Data.db sections=6698
progress=0/2304563183 - 0%
>    abtests: /mnt/data/cassandra070/data/abtests/cust_allocs-hc-2230-Data.db sections=7662
progress=0/3082087770 - 0%
>    abtests: /mnt/data/cassandra070/data/abtests/cust_allocs-hc-2229-Data.db sections=7386
progress=0/1324787539 - 0%
> "Thread-198896" prio=10 tid=0x00002aaac0e00800 nid=0x4710 runnable [0x000000004251b000]
>    java.lang.Thread.State: RUNNABLE
>         at Method)
>         at
>         at
>         at
>         at
>         at
>         - locked <0x00000005e220a170> (a java.lang.Object)
>         at
>         at
>         - locked <0x00000005e220a1b8> (a
>         at com.ning.compress.lzf.LZFDecoder.readFully(
>         at com.ning.compress.lzf.LZFDecoder.decompressChunk(
>         at com.ning.compress.lzf.LZFInputStream.readyBuffer(
>         at
>         at
>         at
>         at org.apache.cassandra.utils.BytesReadTracker.readLong(
>         at org.apache.cassandra.db.ColumnSerializer.deserialize(
>         at org.apache.cassandra.db.ColumnSerializer.deserialize(
>         at
>         at org.apache.cassandra.streaming.IncomingStreamReader.streamIn(
>         at
>         at
>         at

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:!default.jspa
For more information on JIRA, see:


View raw message