cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Peter Schuller (JIRA)" <j...@apache.org>
Subject [jira] Created: (CASSANDRA-1597) cassandra start-up seek-bound in uninterruptable sleep
Date Sat, 09 Oct 2010 11:57:31 GMT
cassandra start-up seek-bound in uninterruptable sleep
------------------------------------------------------

                 Key: CASSANDRA-1597
                 URL: https://issues.apache.org/jira/browse/CASSANDRA-1597
             Project: Cassandra
          Issue Type: Improvement
            Reporter: Peter Schuller
            Priority: Minor


(Cassandra trunk from a few months ago, Linux 2.6.32)

I just restarted a Cassandra node with a large (a few TB:s) column family. I was expecting
index sampling to take a while, but in this case I'm seeing something unexpected. First, the
log has:

10/10/09 11:15:49 INFO config.DatabaseDescriptor: Auto DiskAccessMode determined to be mmap
10/10/09 11:15:50 INFO sstable.SSTableReader: Sampling index for /var/lib/cassandra/data/system/Schema-c-1-<>
10/10/09 11:15:50 INFO sstable.SSTableReader: Sampling index for /var/lib/cassandra/data/system/Migrations-c-1-<>
10/10/09 11:15:50 INFO sstable.SSTableReader: Sampling index for /var/lib/cassandra/data/system/LocationInfo-c-1-<>
10/10/09 11:15:50 INFO config.DatabaseDescriptor: Loading schema version d8ad1485-a6e4-11df-beb1-d1273d6ae3d0

And it is currently stuck with iostat looking like this:

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await
 svctm  %util
sdb               0.00    67.00  146.00    2.00   584.00   276.00    11.62     0.99    6.68
  6.68  98.80

I re-checked the code and the index sampling really should not be seek bound since it is sequential.
At first I figured that perhaps some aspect of the I/O pattern is causing the kernel (and
I guess underlying RAID+drives) to not do read-ahead. This could explain why lots of small
reads, even though they are sequential in nature, could end up being seek bound.

mmap() has the problem that you cannot strace to see what is doing I/O, so I wanted to confirm
with jstack. At this point I notice jstack won't attach at all (after a timeout).

Further, a SIGKILL fails to kill Java, with iostat proceeding, still seek bound, and the process
in uninterruptable sleep (presumably constantly, or else the SIGKILL should have worked).

Somewhere along the way, strace has begun failing to attach as well but from my remembrance
I didn't see any I/O calls when it did attach properly, which is consistent with whatever
I/O is happening being triggered by mmap():ed memory access or else by some other long-running
syscall.

A very speculative hypothesis is that perhaps the initial mmap() call can be long and blocking,
seek-bound, due to inodes being consulted on disk (compare with the cost of "rm:ing" large
files on ext3fs). This would probably have the potential to take quite a long time, if it
is the case. On the other hand, if this was the behavior of mmap() on ext3fs, it seems unlikely
that it would not be widely known already. It also seems unlikely in terms of prior probability
since I don't see a good reason why mmap() would have a need to scan inodes.

I am currently waiting it out to see if/when it eventually gets killed.

I will then try restarting the node in standard I/O mode and see if that works. If it does,
I'll re-try with mmap():ed mode to see that I can trigger it again.



-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message