cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sameer Farooqui <cassandral...@gmail.com>
Subject Re: Repair fails with java.io.IOError: java.io.EOFException
Date Tue, 26 Jul 2011 22:10:12 GMT
Thanks for the info guys.

I'm running compaction on the two very highly loaded nodes now in hopes of
the data volume going down. But I'm skeptical because I don't see how it got
so unbalanced in the first place (all nodes were up while the writes were
being injected).

I should have an update tomorrow on whether compaction rebalanced the nodes.

The tokens are evenly distributed across the ring:

Address DC Rack Status State Load Owns Token
148873535527910577765226390751398592512
10.192.143.x DC1 RAC1 Up Normal 643.42 GB 12.50% 0
10.192.171.x DC1 RAC1 Up Normal 128.96 GB 6.25%
21267647932558653966460912964485513216
10.210.95.x DC1 RAC1 Up Normal 128.34 GB 12.50%
42535295865117307932921825928971026432
10.211.19.x DC1 RAC1 Up Normal 128.55 GB 6.25%
63802943797675961899382738893456539648
10.68.58.x DC1 RAC2 Up Normal 643.05 GB 12.50%
85070591730234615865843651857942052864
10.110.31.x DC1 RAC2 Up Normal 128.84 GB 6.25%
106338239662793269832304564822427566080
10.96.58.x DC1 RAC2 Up Normal 128.11 GB 12.50%
127605887595351923798765477786913079296
10.210.195.x DC1 RAC2 Up Normal 129.33 GB 6.25%
148873535527910577765226390751398592512
10.114.138.x DC2 RAC1 Up Normal 258.04 GB 6.25%
10633823966279326983230456482242756608
10.203.79.x DC2 RAC1 Up Normal 257.14 GB 6.25%
53169119831396634916152282411213783040
10.242.209.x DC2 RAC1 Up Normal 256.58 GB 6.25%
95704415696513942849074108340184809472
10.38.25.x DC2 RAC1 Up Normal 257.08 GB 6.25%
138239711561631250781995934269155835904

On Tue, Jul 26, 2011 at 1:59 AM, aaron morton <aaron@thelastpickle.com>wrote:

> Was guessing something like a token move may have happened in the past.
>
> Good suggestion to also kick off a major compaction. I've seen that make a
> big difference even for apps that do not do deletes, but do do overwrites.
>
> Cheers
>
> -----------------
> Aaron Morton
> Freelance Cassandra Developer
> @aaronmorton
> http://www.thelastpickle.com
>
> On 26 Jul 2011, at 19:00, Sylvain Lebresne wrote:
>
> >> If they are and repair has completed use node tool cleanup to remove the
> >> data the node is no longer responsible. See bootstrap section above.
> >
> > I've seen that said a few times so allow me to correct. Cleanup is
> useless after
> > a repair. 'nodetool cleanup' removes rows the node is not responsible
> anymore
> > and is thus useful only after operations that change the range a node is
> > responsible for (bootstrap, move, decommission). After a repair, you will
> need
> > compaction to kick in to see you disk usage come back to normal.
> >
> > --
> > Sylvain
> >
> >> Hope that helps.
> >>
> >> -----------------
> >> Aaron Morton
> >> Freelance Cassandra Developer
> >> @aaronmorton
> >> http://www.thelastpickle.com
> >> On 26 Jul 2011, at 12:44, Sameer Farooqui wrote:
> >>
> >> Looks like the repair finished successfully the second time. However,
> the
> >> cluster is still severely unbalanced. I was hoping the repair would
> balance
> >> the nodes. We're using random partitioner. One node has 900GB and others
> >> have 128GB, 191GB, 129GB, 257 GB, etc. The 900GB and the 646GB are just
> >> insanely high. Not sure why or how to troubleshoot.
> >>
> >>
> >>
> >> On Fri, Jul 22, 2011 at 1:28 PM, Sameer Farooqui <
> cassandralabs@gmail.com>
> >> wrote:
> >>>
> >>> I don't see a JVM crashlog ( hs_err_pid[pid].log) in
> >>> ~/brisk/resources/cassandra/bin or /tmp. So maybe JVM didn't crash?
> >>>
> >>> We're running a pretty up to date with Sun Java:
> >>>
> >>> ubuntu@ip-10-2-x-x:/tmp$ java -version
> >>> java version "1.6.0_24"
> >>> Java(TM) SE Runtime Environment (build 1.6.0_24-b07)
> >>> Java HotSpot(TM) 64-Bit Server VM (build 19.1-b02, mixed mode)
> >>>
> >>> I'm gonna restart the Repair process in a few more hours. If there are
> any
> >>> additional debug or troubleshooting logs you'd like me to enable first,
> >>> please let me know.
> >>>
> >>> - Sameer
> >>>
> >>>
> >>>
> >>> On Thu, Jul 21, 2011 at 5:31 PM, Jonathan Ellis <jbellis@gmail.com>
> wrote:
> >>>>
> >>>> Did you check for a JVM crash log?
> >>>>
> >>>> You should make sure you're running the latest Sun JVM, older versions
> >>>> and OpenJDK in particular are prone to segfaulting.
> >>>>
> >>>> On Thu, Jul 21, 2011 at 6:53 PM, Sameer Farooqui
> >>>> <cassandralabs@gmail.com> wrote:
> >>>>> We are starting Cassandra with "brisk cassandra", so as a stand-alone
> >>>>> process, not a service.
> >>>>>
> >>>>> The syslog on the node doesn't show anything regarding the Cassandra
> >>>>> Java
> >>>>> process around the time the last entries were made in the Cassandra
> >>>>> system.log (2011-07-21 13:01:51):
> >>>>>
> >>>>> Jul 21 12:35:01 ip-10-2-206-127 CRON[12826]: (root) CMD (command
-v
> >>>>> debian-sa1 > /dev/null && debian-sa1 1 1)
> >>>>> Jul 21 12:45:01 ip-10-2-206-127 CRON[13420]: (root) CMD (command
-v
> >>>>> debian-sa1 > /dev/null && debian-sa1 1 1)
> >>>>> Jul 21 12:55:01 ip-10-2-206-127 CRON[14021]: (root) CMD (command
-v
> >>>>> debian-sa1 > /dev/null && debian-sa1 1 1)
> >>>>> Jul 21 14:26:07 ip-10-2-206-127 kernel: imklog 4.2.0, log source
=
> >>>>> /proc/kmsg started.
> >>>>> Jul 21 14:26:07 ip-10-2-206-127 rsyslogd: [origin software="rsyslogd"
> >>>>> swVersion="4.2.0" x-pid="663" x-info="http://www.rsyslog.com"]
> >>>>> (re)start
> >>>>>
> >>>>>
> >>>>> The last thing in the Cassandra log before INFO Logging initialized
> is:
> >>>>>
> >>>>>  INFO [ScheduledTasks:1] 2011-07-21 13:01:51,187 GCInspector.java
> (line
> >>>>> 128)
> >>>>> GC for ParNew: 202 ms, 153219160 reclaimed leaving 2040879600 used;
> max
> >>>>> is
> >>>>> 4030726144
> >>>>>
> >>>>>
> >>>>> I can start Repair again, but am worried that it will crash Cassandra
> >>>>> again,
> >>>>> so I want to turn on any debugging or helpful logs to diagnose the
> >>>>> crash if
> >>>>> it happens again.
> >>>>>
> >>>>>
> >>>>> - Sameer
> >>>>>
> >>>>>
> >>>>> On Thu, Jul 21, 2011 at 4:30 PM, aaron morton <
> aaron@thelastpickle.com>
> >>>>> wrote:
> >>>>>>
> >>>>>> The default init.d script will direct std out/err to that file,
how
> >>>>>> are
> >>>>>> you starting brisk / cassandra ?
> >>>>>> Check the syslog and other logs in /var/log to see if the OS
killed
> >>>>>> cassandra.
> >>>>>> Also, what was the last thing in the casandra log before INFO
[main]
> >>>>>> 2011-07-21 15:48:07,233 AbstractCassandraDaemon.java (line 78)
> Logging
> >>>>>> initialised ?
> >>>>>>
> >>>>>> Cheers
> >>>>>>
> >>>>>> -----------------
> >>>>>> Aaron Morton
> >>>>>> Freelance Cassandra Developer
> >>>>>> @aaronmorton
> >>>>>> http://www.thelastpickle.com
> >>>>>> On 22 Jul 2011, at 10:50, Sameer Farooqui wrote:
> >>>>>>
> >>>>>> Hey Aaron,
> >>>>>>
> >>>>>> I don't have any output.log files in that folder:
> >>>>>>
> >>>>>> ubuntu@ip-10-2-x-x:~$ cd /var/log/cassandra
> >>>>>> ubuntu@ip-10-2-x-x:/var/log/cassandra$ ls
> >>>>>> system.log     system.log.11  system.log.4  system.log.7
> >>>>>> system.log.1   system.log.2   system.log.5  system.log.8
> >>>>>> system.log.10  system.log.3   system.log.6  system.log.9
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> On Thu, Jul 21, 2011 at 3:40 PM, aaron morton
> >>>>>> <aaron@thelastpickle.com>
> >>>>>> wrote:
> >>>>>>>
> >>>>>>> Check /var/log/cassandra/output.log (assuming the default
init
> >>>>>>> scripts)
> >>>>>>> A
> >>>>>>> -----------------
> >>>>>>> Aaron Morton
> >>>>>>> Freelance Cassandra Developer
> >>>>>>> @aaronmorton
> >>>>>>> http://www.thelastpickle.com
> >>>>>>> On 22 Jul 2011, at 10:13, Sameer Farooqui wrote:
> >>>>>>>
> >>>>>>> Hmm. Just looked at the log more closely.
> >>>>>>>
> >>>>>>> So, what actually happened is while Repair was running on
this
> >>>>>>> specific
> >>>>>>> node, the Cassandra java process terminated itself automatically.
> The
> >>>>>>> last
> >>>>>>> entries in the log are:
> >>>>>>>
> >>>>>>>  INFO [ScheduledTasks:1] 2011-07-21 13:00:20,285 GCInspector.java
> >>>>>>> (line
> >>>>>>> 128) GC for ParNew: 214 ms, 162748656 reclaimed leaving
1845274888
> >>>>>>> used; max
> >>>>>>> is 4030726144
> >>>>>>>  INFO [ScheduledTasks:1] 2011-07-21 13:00:27,375 GCInspector.java
> >>>>>>> (line
> >>>>>>> 128) GC for ParNew: 266 ms, 158835624 reclaimed leaving
1864471688
> >>>>>>> used; max
> >>>>>>> is 4030726144
> >>>>>>>  INFO [ScheduledTasks:1] 2011-07-21 13:00:57,658 GCInspector.java
> >>>>>>> (line
> >>>>>>> 128) GC for ParNew: 251 ms, 148861328 reclaimed leaving
1931111120
> >>>>>>> used; max
> >>>>>>> is 4030726144
> >>>>>>>  INFO [ScheduledTasks:1] 2011-07-21 13:01:19,358 GCInspector.java
> >>>>>>> (line
> >>>>>>> 128) GC for ParNew: 260 ms, 157638152 reclaimed leaving
1955746368
> >>>>>>> used; max
> >>>>>>> is 4030726144
> >>>>>>>  INFO [ScheduledTasks:1] 2011-07-21 13:01:22,729 GCInspector.java
> >>>>>>> (line
> >>>>>>> 128) GC for ParNew: 325 ms, 154157352 reclaimed leaving
1969361176
> >>>>>>> used; max
> >>>>>>> is 4030726144
> >>>>>>>  INFO [ScheduledTasks:1] 2011-07-21 13:01:51,187 GCInspector.java
> >>>>>>> (line
> >>>>>>> 128) GC for ParNew: 202 ms, 153219160 reclaimed leaving
2040879600
> >>>>>>> used; max
> >>>>>>> is 4030726144
> >>>>>>>
> >>>>>>> When we came in this morning, nodetool ring from another
node
> showed
> >>>>>>> the
> >>>>>>> 1st node as down and OpsCenter also reported it as down.
> >>>>>>>
> >>>>>>> Next we ran "sudo netstat -anp | grep 7199" from the 1st
node to
> see
> >>>>>>> the
> >>>>>>> status of the Cassandra PID and it was not running.
> >>>>>>>
> >>>>>>> We then started Cassandra:
> >>>>>>>
> >>>>>>> INFO [main] 2011-07-21 15:48:07,233 AbstractCassandraDaemon.java
> >>>>>>> (line
> >>>>>>> 78) Logging initialized
> >>>>>>>  INFO [main] 2011-07-21 15:48:07,266 AbstractCassandraDaemon.java
> >>>>>>> (line
> >>>>>>> 96) Heap size: 3894411264/3894411264
> >>>>>>>  INFO [main] 2011-07-21 15:48:11,678 CLibrary.java (line
106) JNA
> >>>>>>> mlockall successful
> >>>>>>>  INFO [main] 2011-07-21 15:48:11,702 DatabaseDescriptor.java
(line
> >>>>>>> 121)
> >>>>>>> Loading settings from
> >>>>>>> file:/home/ubuntu/brisk/resources/cassandra/conf/cassandra.yaml
> >>>>>>>
> >>>>>>>
> >>>>>>> It was during this start process that the java.io.EOFException
was
> >>>>>>> seen,
> >>>>>>> but yes, like you said Jonathan, the Cassandra process started
back
> >>>>>>> up and
> >>>>>>> joined the ring.
> >>>>>>>
> >>>>>>> We're now wondering why the Repair failed and why Cassandra
crashed
> >>>>>>> in
> >>>>>>> the first place. We only had default level logging enabled.
Is
> there
> >>>>>>> something else I can check or that you suspect?
> >>>>>>>
> >>>>>>> Should we turn the logging up to debug and retry the Repair?
> >>>>>>>
> >>>>>>>
> >>>>>>> - Sameer
> >>>>>>>
> >>>>>>>
> >>>>>>> On Thu, Jul 21, 2011 at 12:37 PM, Jonathan Ellis <
> jbellis@gmail.com>
> >>>>>>> wrote:
> >>>>>>>>
> >>>>>>>> Looks harmless to me.
> >>>>>>>>
> >>>>>>>> On Thu, Jul 21, 2011 at 1:41 PM, Sameer Farooqui
> >>>>>>>> <cassandralabs@gmail.com> wrote:
> >>>>>>>>> While running Repair on a 0.8.1 node, we got this
error in the
> >>>>>>>>> system.log:
> >>>>>>>>>
> >>>>>>>>> ERROR [Thread-23] 2011-07-21 15:48:43,868
> >>>>>>>>> AbstractCassandraDaemon.java
> >>>>>>>>> (line
> >>>>>>>>> 113) Fatal exception in thread Thread[Thread-23,5,main]
> >>>>>>>>> java.io.IOError: java.io.EOFException
> >>>>>>>>> at
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> org.apache.cassandra.net.IncomingTcpConnection.run(IncomingTcpConnection.java:78)
> >>>>>>>>> Caused by: java.io.EOFException
> >>>>>>>>> at java.io.DataInputStream.readInt(DataInputStream.java:375)
> >>>>>>>>> at
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> org.apache.cassandra.net.IncomingTcpConnection.run(IncomingTcpConnection.java:66)
> >>>>>>>>>
> >>>>>>>>> There's just a bunch of informational messages about
Gossip
> before
> >>>>>>>>> this.
> >>>>>>>>>
> >>>>>>>>> Looks like the file or stream unexpectedly ended?
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> http://download.oracle.com/javase/1.4.2/docs/api/java/io/EOFException.html
> >>>>>>>>>
> >>>>>>>>> Is this a bug or something wrong in our environment?
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> - Sameer
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> --
> >>>>>>>> Jonathan Ellis
> >>>>>>>> Project Chair, Apache Cassandra
> >>>>>>>> co-founder of DataStax, the source for professional
Cassandra
> >>>>>>>> support
> >>>>>>>> http://www.datastax.com
> >>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>>
> >>>>>
> >>>>>
> >>>>
> >>>>
> >>>>
> >>>> --
> >>>> Jonathan Ellis
> >>>> Project Chair, Apache Cassandra
> >>>> co-founder of DataStax, the source for professional Cassandra support
> >>>> http://www.datastax.com
> >>>
> >>
> >>
> >>
>
>

Mime
View raw message