cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Adam Horwich (JIRA)" <>
Subject [jira] [Commented] (CASSANDRA-8620) Bootstrap session hanging indefinitely
Date Thu, 19 Mar 2015 09:44:39 GMT


Adam Horwich commented on CASSANDRA-8620:


We were suspicious of LCS edge-case bugs we'd seen reported elsewhere so we created new tables
with Size Tiered Compaction Strategy and we have not seen the problem since. We have since
upgraded to 2.1.3 so we may re-evaluate LCS in the future.

> Bootstrap session hanging indefinitely
> --------------------------------------
>                 Key: CASSANDRA-8620
>                 URL:
>             Project: Cassandra
>          Issue Type: Bug
>         Environment: Debian 7, Oracle JDK 1.7.0_51, AWS + GCE
>            Reporter: Adam Horwich
> Hi! We have been running a relatively small 2.1.2 cluster over 2 DCs for a few months
with ~100GB load per node and a RF=3 and over the last few weeks have been trying to scale
up capacity. 
> We've been recently seeing scenarios in which the Bootstrap or Unbootstrap streaming
process hangs indefinitely for one or more sessions on the receiver without stacktrace or
exception. This does not happen every time, and we do not get into this state with the same
sender every time. When the receiver is in a hung state, the following can be found in the
thread dump:
> The Stream-IN thread for one or more sessions is blocked in the following state: 
> Thread 24942: (state = BLOCKED)
>  - sun.misc.Unsafe.park(boolean, long) @bci=0 (Compiled frame; information may be imprecise)
>  - java.util.concurrent.locks.LockSupport.park(java.lang.Object) @bci=14, line=186 (Compiled
>  - java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await() @bci=42,
line=2043 (Compiled frame)
>  - java.util.concurrent.ArrayBlockingQueue.take() @bci=20, line=374 (Compiled frame)
>  - @bci=31, line=89
(Compiled frame)
>  - @bci=4, line=337 (Compiled frame)
>  - org.apache.cassandra.utils.BytesReadTracker.readUnsignedShort() @bci=4, line=140 (Compiled
>  - org.apache.cassandra.utils.ByteBufferUtil.readShortLength( @bci=1,
line=317 (Compiled frame)
>  - org.apache.cassandra.utils.ByteBufferUtil.readWithShortLength( @bci=2,
line=327 (Compiled frame)
>  - org.apache.cassandra.db.composites.AbstractCType$Serializer.deserialize(
@bci=5, line=397 (Compiled frame)
>  - org.apache.cassandra.db.composites.AbstractCType$Serializer.deserialize(
@bci=2, line=381 (Compiled frame)
>  - org.apache.cassandra.db.OnDiskAtom$Serializer.deserializeFromSSTable(,
org.apache.cassandra.db.ColumnSerializer$Flag, int,$Version)
@bci=10, line=75 (Compiled frame)
>  - org.apache.cassandra.db.AbstractCell$1.computeNext() @bci=25, line=52 (Compiled frame)
>  - org.apache.cassandra.db.AbstractCell$1.computeNext() @bci=1, line=46 (Compiled frame)
>  - @bci=9, line=143 (Compiled
>  - @bci=61, line=138 (Compiled frame)
>  -,
@bci=320, line=283 (Compiled frame)
>  - org.apache.cassandra.streaming.StreamReader.writeRow(,, org.apache.cassandra.db.ColumnFamilyStore) @bci=26, line=157 (Compiled
>  -
@bci=258, line=89 (Compiled frame)
>  - org.apache.cassandra.streaming.messages.IncomingFileMessage$1.deserialize(java.nio.channels.ReadableByteChannel,
int, org.apache.cassandra.streaming.StreamSession) @bci=69, line=48 (Interpreted frame)
>  - org.apache.cassandra.streaming.messages.IncomingFileMessage$1.deserialize(java.nio.channels.ReadableByteChannel,
int, org.apache.cassandra.streaming.StreamSession) @bci=4, line=38 (Interpreted frame)
>  - org.apache.cassandra.streaming.messages.StreamMessage.deserialize(java.nio.channels.ReadableByteChannel,
int, org.apache.cassandra.streaming.StreamSession) @bci=37, line=55 (Interpreted frame)
>  - org.apache.cassandra.streaming.ConnectionHandler$ @bci=24,
line=245 (Interpreted frame)
>  - @bci=11, line=744 (Interpreted frame)
> Debug logging shows that the receiver is still reading the file it is receiving from
the receiver and has not yet sent an ACK.
> The receiver is waiting for more data to finish writing its row, and the sender is not
sending any more data. On both the receiver and sender there is a large amount of data (~5MB)
stuck in the Recv-Q (receiver) and Send-Q (sender). 
> We've been trying to diagnose this issue internally, but it's difficult to create a reliably
reproducible scenario. We have so far found that restarting all nodes in the cluster and ensuring
that a cleanup has been performed helps mitigate the problem (though a cleanup without restart
can still result in a hung state). However, it's unclear to me why either of these would affect
the streaming process in the way we've observed. One theory is that the calculated section
sizes are inaccurate.
> We tried setting a timeout on the dataBuffer reading (moving take to poll), which forced
a retry on the file, but the same transfer failed again until retries were exhausted.
> From a heap dump in the dead locked state we confirmed the following:
> ByteTracker.bytesRead = 100477411
> Length of the section being read (happens to be the last section) 100477411
> We also wrote all the data in the buffer to a text file and observed that the data in
the buffer ended half way through a row.
> The stream reader checks the value of bytes read after importing each row (StreamReader
line 97) to detect the end of the section, however it does not expect the data to finish half
way through a row. This happens in OnDiskAtom.deserializeFromSSTable in any of the various
deserialization calls.

This message was sent by Atlassian JIRA

View raw message