incubator-cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Oleg Proudnikov <ol...@cloudorange.com>
Subject Cassandra Thrift calls block indefinitely
Date Mon, 07 Feb 2011 04:26:04 GMT
Hi All,

This is the first time I see this. I am using Hector for a bulk load into a 3
node Cassandra 0.7.0 cluster. I have been doing this for a while now but this
time the load was more intense compared to the ones before and it was running
from a single client machine because I was afraid to overload the cluster.

My client blocked on a batch_insert for a very long time. Actually the call
never returned. This happened during bulk load at night. In fact the client got
auto-killed and restarted 20 times and every time it was blocked. After seeing
this in the morning many hours later I ran new Java stress test and saw this:

./stress -o INSERT -e ONE -c 30 -n 100000 -l 1 -i 1 -t 1
Created keyspaces. Sleeping 1s for propagation.
total,interval_op_rate,interval_key_rate,avg_latency,elapsed_time
0,0,0,NaN,1
0,0,0,NaN,2
0,0,0,NaN,3
0,0,0,NaN,4
0,0,0,NaN,5
0,0,0,NaN,6

Then I used cli to insert a single column into CF Standard1 created by the test
above. After this stress test was able to run. Here is what the log had in the
morning - the time is European. This is likely related to the stress test. I am
including this for completeness.

 INFO [MigrationStage:1] 2011-02-05 14:29:52,689 ColumnFamilyStore.java (line
648) switching in a fresh Memtable for Migrations at
CommitLogContext(file='/data/social2/
commitlog/CommitLog-1296861890465.log', position=6798985)
 INFO [MigrationStage:1] 2011-02-05 14:29:52,689 ColumnFamilyStore.java (line
952) Enqueuing flush of Memtable-Migrations@24152491(9724 bytes, 1 operations)
 INFO [MigrationStage:1] 2011-02-05 14:29:52,689 ColumnFamilyStore.java (line
648) switching in a fresh Memtable for Schema at
CommitLogContext(file='/data/social2/comm
itlog/CommitLog-1296861890465.log', position=6798985)
 INFO [FlushWriter:1] 2011-02-05 14:29:52,689 Memtable.java (line 155) Writing
Memtable-Migrations@24152491(9724 bytes, 1 operations)
INFO [MigrationStage:1] 2011-02-05 14:29:52,689 ColumnFamilyStore.java (line
952) Enqueuing flush of Memtable-Schema@22965127(5481 bytes, 5 operations)
INFO [FlushWriter:1] 2011-02-05 14:29:52,955 Memtable.java (line 162) Completed
flushing /data/social2/data/system/Migrations-e-15-Data.db (9857 bytes)
INFO [FlushWriter:1] 2011-02-05 14:29:52,955 Memtable.java (line 155) Writing
Memtable-Schema@22965127(5481 bytes, 5 operations)
ERROR [MutationStage:12] 2011-02-05 14:29:53,014 RowMutationVerbHandler.java
(line 83) Error in row mutation
org.apache.cassandra.db.UnserializableColumnFamilyException: Couldn't find 
cfId=1009

================

Here is a summary of some unusual activity on one of the nodes before the first
blockage. This node is different from the other nodes - it is slower according
to my stress tests due to non-Cassandra background and disk activity:

- node had an OOM 10 minutes prior to the first blockage, 
- node restarted normally
- replayed commit logs
- lost contact with 2 other nodes for a few milliseconds
- flushed one memtable with 640K operations
- filled the same memtable with another 640K columns in 9 seconds ??? impossible
 ??? Maybe I am misreading the log below?

INFO [MutationStage:4] 2011-02-05 00:18:14,437 ColumnFamilyStore.java (line 952)
Enqueuing flush of Memtable-Domains@29765464(14150991 bytes, 640000 operations)
INFO [MutationStage:4] 2011-02-05 00:18:23,743 ColumnFamilyStore.java (line 952)
Enqueuing flush of Memtable-Domains@1573293(13948680 bytes, 640000 operations)

- compacted 4 SSTables for this CF

- compacted 5 SSTables for the same CF - very unusual

- reported a large row - this CF does have a large row but it should not have
been touched in this load.

- while the above is happening CMS is reclaiming 1.2G several times

- node lost contact with another node that seems under some pressure with high
heap and no GC to speak of

- the CF in question now receives only 10K columns in 4 minutes

- the last activity is hinted hand off to the stressed node that reappeared.

After the blockage started the only activity I see is all CFs being flushed due
to time and all caches being saved. No other activity until the stress test in
the morning.

Thank you,
Oleg




Mime
View raw message