cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jeff Jirsa <jeff.ji...@crowdstrike.com>
Subject Re: nodetool drain running for days
Date Wed, 06 Apr 2016 17:12:47 GMT
Drain should not run for days – if it were me, I’d be:
Checking for ‘DRAINED’ in the server logs
Running ‘nodetool flush’ just to explicitly flush the commitlog/memtables (generally useful
before doing drain, too, it can be somewhat race-y)
Explicitly killing cassandra following the flush – drain should simply be a flush+shutdowneverything,
so it should take on the order of seconds, not days.
For your question about 3.0: historically, Cassandra has had some bugs in new major versions
-

Hints were broken from 1.0.0 to 1.0.3 - https://issues.apache.org/jira/browse/CASSANDRA-3466
Hints were broken again from 1.1.0 to 1.1.6 - https://issues.apache.org/jira/browse/CASSANDRA-4772

There was a corruption bug in 2.0 until 2.0.8 - https://issues.apache.org/jira/browse/CASSANDRA-6285
There were a number of rough edges in 2.1, including a memory leak fixed in 2.1.7 - https://issues.apache.org/jira/browse/CASSANDRA-9549

Compaction kept stopping in 2.2.0 until 2.2.2 - https://issues.apache.org/jira/browse/CASSANDRA-10270

Because of this history of “bugs in new versions", many operators choose to hold off on
going to new versions until they’re “better tested”. The catch-22 is obvious here: if
nobody uses it, nobody tests it in the real world to find the bugs not discovered in automated
testing. The Datastax folks did some awesome work for 3.0 to extend the unit and distributed
tests – they’re MUCH better than they were in 2.2, so hopefully there are fewer surprise
bugs in 3+, but there’s bound to be a few. The apache team has also changed the release
cycle to release more frequently, so that there’s less new code in each release (see http://www.planetcassandra.org/blog/cassandra-2-2-3-0-and-beyond/
).  If you’ve got a lab/demo/stage/test environment that can tolerate some outages, I definitely
encourage you to upgrade there, first. If a few surprise issues will cost your company millions
of dollars, or will cost you your job, let someone else upgrade and be the guinea pig, and
don’t upgrade until you’re compelled to do so because of a bug fix you need, or a feature
that won’t be in the version you’re running.



From:  Paco Trujillo
Reply-To:  "user@cassandra.apache.org"
Date:  Tuesday, April 5, 2016 at 11:12 PM
To:  "user@cassandra.apache.org"
Subject:  nodetool drain running for days

We are having performance problems with our cluster regarding to timeouts when repairs are
running or massive deletes. One of the advice I received was update our casssandra version
from 2.0.17 to 2.2. I am draining one of the nodes to start the upgrade and the drain is running
now for two days. In the logs  only see log like these from time to time:

 

INFO [ScheduledTasks:1] 2016-04-06 08:17:10,987 ColumnFamilyStore.java (line 808) Enqueuing
flush of Memtable-sstable_activity@1382334976(15653/226669 serialized/live bytes, 6023 ops)

INFO [FlushWriter:1468] 2016-04-06 08:17:10,988 Memtable.java (line 362) Writing Memtable-sstable_activity@1382334976(15653/226669
serialized/live bytes, 6023 ops)

INFO [ScheduledTasks:1] 2016-04-06 08:17:11,004 ColumnFamilyStore.java (line 808) Enqueuing
flush of Memtable-compaction_history@1425848386(1599/15990 serialized/live bytes, 51 ops)

INFO [FlushWriter:1468] 2016-04-06 08:17:11,012 Memtable.java (line 402) Completed flushing
/var/lib/cassandra/data/system/sstable_activity/system-sstable_activity-jb-4826-Data.db (6348
bytes) for commitlog position ReplayPosition(segmentId=1458540068021, position=1198022)

INFO [FlushWriter:1468] 2016-04-06 08:17:11,012 Memtable.java (line 362) Writing Memtable-compaction_history@1425848386(1599/15990
serialized/live bytes, 51 ops)

INFO [FlushWriter:1468] 2016-04-06 08:17:11,039 Memtable.java (line 402) Completed flushing
/var/lib/cassandra/data/system/compaction_history/system-compaction_history-jb-3491-Data.db
(730 bytes) for commitlog position ReplayPosition(segmentId=1458540068021, position=1202850)

 

Should I wait or just stop the node and start the migration?

 

Another question, I have check the changes in 3.0 and I do not see any incompatibilities with
the features we are using at this moment or with our actual hardware (apart from the java
version). Probably more people ask this, but there is some important reason for not upgrade
the cluster?

 

 

 


Mime
View raw message