cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Robert Stupp (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (CASSANDRA-6716) snapshots constantly fail with "Tried to hard link to file that does not exist"
Date Mon, 11 Apr 2016 13:44:25 GMT

    [ https://issues.apache.org/jira/browse/CASSANDRA-6716?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15235102#comment-15235102
] 

Robert Stupp commented on CASSANDRA-6716:
-----------------------------------------

Oh - I think, I found at least some steps to produce a reference leak in 2.2.3. It’s somehow
a guide to “how to not manage C*” - sorry for that.
# on node0 start {{nodetool repair -pr some_keyspace}}
# somewhat later on node1 {{nodetool repair -pr some_keyspace}}
That repair shows some validation errors (concurrent repair message in log) but then appears
to be frozen (it’s not, {{nodetool compactionstats}} shows progress though)
# all nodes start anticompaction
# ctrl-c the repair on node0 - repair showing 100%
# on node1 {{nodetool scrub some_keyspace}}
# on node0 start another {{nodetool repair -pr some_keyspace}}
That repair shows some validation errors (concurrent repair message in log) but then appears
to be frozen, too (it’s not, {{nodetool compactionstats}} shows progress though)
# all nodes show a couple of these messages:
{code}
ERROR [Reference-Reaper:1] 2016-04-11 11:09:44,448 Ref.java:187 - LEAK DETECTED: a reference
(org.apache.cassandra.utils.concurrent.Ref$State@586b1193) to class org.apache.cassandra.io.sstable.format.SSTableReader$InstanceTidier@1454175470:/var/lib/cassandra/data/some_keyspace/some_table-cb9b21d0d6e311e5a8e065ed6fa863b3/la-4129-big
was not released before the reference was garbage collected
{code}
# Got jstack output from this stage (ping me)
# node0 reported (first) repair as finished
# issued nodetool snapshot on node2 - worked fine
# issued nodetool snapshot on node1 - hangs for a long time in a loop logging {{ColumnFamilyStore.java:1896
- Spinning trying to capture all readers}} and {{ColumnFamilyStore.java:1895 - Spinning trying
to capture released readers}}. The bad thing is that both log messages appear multiple times
per millisecond, which is way too much.
# Got log files from this stage and the jstack of node1 (ping me)
# issued nodetool snapshot on node1 - hangs for a long time in a loop logging the same as
above. Both loops are probably caused by the concurrently running {{nodetool scrub}}
# Another issue (don’t know whether it’s related or not) is, that {{nodetool netstats}}
shows 3 or 4 pending large messages since the time when only the first repair has been started.
(Regular repairs were disabled on that cluster.)
# I looks like that any concurrent repair does not advance and will hang there forever

The three key points IMO are:
# The _spinning_ messages should not be logged that often
# The reference leak should not occur
# Duplicate repairs, although detected, seem to cause a hanging repair-job

It’s just a rough evaluation - so apologies if it’s not is super-exact.


> snapshots constantly fail with "Tried to hard link to file that does not exist"
> -------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-6716
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-6716
>             Project: Cassandra
>          Issue Type: Bug
>         Environment: Cassandra 2.0.5 (built from source), Linux, 6 nodes, JDK 1.7
>            Reporter: Nikolai Grigoriev
>         Attachments: system.log.gz
>
>
> It seems that since recently I have started getting a number of exceptions like "File
not found" on all Cassandra nodes. Currently I am getting an exception like this every couple
of seconds on each node, for different keyspaces and CFs.
> I have tried to restart the nodes, tried to scrub them. No luck so far. It seems that
scrub cannot complete on any of these nodes, at some point it fails because of the file that
it can't find.
> One one of the nodes currently the "nodetool scrub" command fails  instantly and consistently
with this exception:
> {code}
> # /opt/cassandra/bin/nodetool scrub 
> Exception in thread "main" java.lang.RuntimeException: Tried to hard link to file that
does not exist /mnt/disk5/cassandra/data/mykeyspace_jmeter/test_contacts/mykeyspace_jmeter-test_contacts-jb-28049-Data.db
> 	at org.apache.cassandra.io.util.FileUtils.createHardLink(FileUtils.java:75)
> 	at org.apache.cassandra.io.sstable.SSTableReader.createLinks(SSTableReader.java:1215)
> 	at org.apache.cassandra.db.ColumnFamilyStore.snapshotWithoutFlush(ColumnFamilyStore.java:1826)
> 	at org.apache.cassandra.db.ColumnFamilyStore.scrub(ColumnFamilyStore.java:1122)
> 	at org.apache.cassandra.service.StorageService.scrub(StorageService.java:2159)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> 	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> 	at java.lang.reflect.Method.invoke(Method.java:606)
> 	at sun.reflect.misc.Trampoline.invoke(MethodUtil.java:75)
> 	at sun.reflect.GeneratedMethodAccessor15.invoke(Unknown Source)
> 	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> 	at java.lang.reflect.Method.invoke(Method.java:606)
> 	at sun.reflect.misc.MethodUtil.invoke(MethodUtil.java:279)
> 	at com.sun.jmx.mbeanserver.StandardMBeanIntrospector.invokeM2(StandardMBeanIntrospector.java:112)
> 	at com.sun.jmx.mbeanserver.StandardMBeanIntrospector.invokeM2(StandardMBeanIntrospector.java:46)
> 	at com.sun.jmx.mbeanserver.MBeanIntrospector.invokeM(MBeanIntrospector.java:237)
> 	at com.sun.jmx.mbeanserver.PerInterface.invoke(PerInterface.java:138)
> 	at com.sun.jmx.mbeanserver.MBeanSupport.invoke(MBeanSupport.java:252)
> 	at com.sun.jmx.interceptor.DefaultMBeanServerInterceptor.invoke(DefaultMBeanServerInterceptor.java:819)
> 	at com.sun.jmx.mbeanserver.JmxMBeanServer.invoke(JmxMBeanServer.java:801)
> 	at javax.management.remote.rmi.RMIConnectionImpl.doOperation(RMIConnectionImpl.java:1487)
> 	at javax.management.remote.rmi.RMIConnectionImpl.access$300(RMIConnectionImpl.java:97)
> 	at javax.management.remote.rmi.RMIConnectionImpl$PrivilegedOperation.run(RMIConnectionImpl.java:1328)
> 	at javax.management.remote.rmi.RMIConnectionImpl.doPrivilegedOperation(RMIConnectionImpl.java:1420)
> 	at javax.management.remote.rmi.RMIConnectionImpl.invoke(RMIConnectionImpl.java:848)
> 	at sun.reflect.GeneratedMethodAccessor38.invoke(Unknown Source)
> 	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> 	at java.lang.reflect.Method.invoke(Method.java:606)
> 	at sun.rmi.server.UnicastServerRef.dispatch(UnicastServerRef.java:322)
> 	at sun.rmi.transport.Transport$1.run(Transport.java:177)
> 	at sun.rmi.transport.Transport$1.run(Transport.java:174)
> 	at java.security.AccessController.doPrivileged(Native Method)
> 	at sun.rmi.transport.Transport.serviceCall(Transport.java:173)
> 	at sun.rmi.transport.tcp.TCPTransport.handleMessages(TCPTransport.java:553)
> 	at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run0(TCPTransport.java:808)
> 	at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run(TCPTransport.java:667)
> 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> 	at java.lang.Thread.run(Thread.java:724)
> {code}
> Also I have noticed that the files that are missing are often (or maybe always?) referred
to in the log as follows:
> {quote}
>  WARN 00:06:10,597 At level 3, SSTableReader(path='/mnt/disk5/cassandra/data/mykeyspace_jmeter/test_contacts/mykeyspace_jmeter-test_contacts-jb-26776-Data.db')
[DecoratedKey(-9053060597280257896, 0010f582cddaca974d7198ae30f194ccfd0c00001000000000004b818d00000000000000010000100000000000004000000000000000000300),
DecoratedKey(-8855915848970248008, 00103ce153dfeeb547fb881a51adf611f6cf0000100000000000f04f4700000000000000010000100000000000004000000000000000000500)]
overlaps SSTableReader(path='/mnt/disk2/cassandra/data/mykeyspace_jmeter/test_contacts/mykeyspace_jmeter-test_contacts-jb-28022-Data.db')
[DecoratedKey(-8964446543595889729, 001043214a8bdcfd46a3b8ea71da2d57bb9a0000100000000001117c0d00000000000000000000100000000000004000000000000000000100),
DecoratedKey(-8848132752710859808, 0010d1f6de8039d54218bf5b1e184335df5f000010000000000062e52600000000000000010000100000000000004000000000000000000400)].
 This could be caused by a bug in Cassandra 1.1.0 .. 1.1.3 or due to the fact that you have
dropped sstables from another node into the data directory. Sending back to L0.  If you didn't
drop in sstables, and have not yet run scrub, you should do so since you may also have rows
out-of-order within an sstable
>  WARN [RMI TCP Connection(2)-10.3.45.158] 2014-02-18 00:06:10,597 LeveledManifest.java
(line 171) At level 3, SSTableReader(path='/mnt/disk5/cassandra/data/mykeyspace_jmeter/test_contacts/mykeyspace_jmeter-test_contacts-jb-26776-Data.db')
[DecoratedKey(-9053060597280257896, 0010f582cddaca974d7198ae30f194ccfd0c00001000000000004b818d00000000000000010000100000000000004000000000000000000300),
DecoratedKey(-8855915848970248008, 00103ce153dfeeb547fb881a51adf611f6cf0000100000000000f04f4700000000000000010000100000000000004000000000000000000500)]
overlaps SSTableReader(path='/mnt/disk2/cassandra/data/mykeyspace_jmeter/test_contacts/mykeyspace_jmeter-test_contacts-jb-28022-Data.db')
[DecoratedKey(-8964446543595889729, 001043214a8bdcfd46a3b8ea71da2d57bb9a0000100000000001117c0d00000000000000000000100000000000004000000000000000000100),
DecoratedKey(-8848132752710859808, 0010d1f6de8039d54218bf5b1e184335df5f000010000000000062e52600000000000000010000100000000000004000000000000000000400)].
 This could be caused by a bug in Cassandra 1.1.0 .. 1.1.3 or due to the fact that you have
dropped sstables from another node into the data directory. Sending back to L0.  If you didn't
drop in sstables, and have not yet run scrub, you should do so since you may also have rows
out-of-order within an sstable
> {quote}
> I never had anything but Cassandra 2.0 on these systems. Also I have recreated my test
data from scratch with 2.0.4.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message