Thanks.

I can see thte PHI value rose from 0.5 up to 30 while the other node stuck at "Updating heartbeat state version"  for about 9~30sec.
This freeze does not happen in single node test under twice as much pressure.
What's more, I monitored the JVM status and found neither CPU nor memory usage peaked during the freeze.
It looks like some network problem just blocks everything.

------------------
TRACE [GossipStage:1] 2011-05-03 14:17:21,955 Gossiper.java (line 720) Updating heartbeat state version to 1234 from 1233 for /192.168.125.49 ...
 INFO [ScheduledTasks:1] 2011-05-03 14:17:30,042 GCInspector.java (line 128) GC for ParNew: 596 ms, -35189184 reclaimed leaving 6176494968 used; max is 8466202624
------------------
TRACE [GossipStage:1] 2011-05-03 14:17:38,093 Gossiper.java (line 720) Updating heartbeat state version to 1238 from 1237 for /192.168.125.51 ...
TRACE [GossipTasks:1] 2011-05-03 14:18:08,633 Gossiper.java (line 306) Gossip Digests are : cas01/192.168.125.49:1304402198:1251 cas01/192.168.125.49:1304402198:1251 /192.168.125.51:1304402204:1238 
------------------


The detail log is attached.


2011/4/30 aaron morton <aaron@thelastpickle.com>
Can you try turning up the logging to TRACE, if you know how turn it up for the org.apache.cassandra.gms.FailureDetector and org.apache.cassandra.gms.Gossiper loggers

You'll see a lot of messages. Some of the interesting ones will be...

"FailureDetector.java (line 128) reporting /127.0.0.2"  - logged when the end point reports 
"FailureDetector.java (line 149) PHI for /127.0.0.2 : 0.9127981004059462" - logged when an end point is checked before possibly marking dead

There will also be messages from the gossiper about receiving and sending.

Aaron
 
On 29 Apr 2011, at 14:50, Sheng Chen wrote:

Thank you for your patch. 
I believe the latter version I used (the latest 0.7 branch) includes the patch, but the problem remains.

Is there anything else that may block this heartbeat, like gc?
Here are some logs during heartbeat failure.

 INFO [GossipTasks:1] 2011-04-29 07:25:09,716 Gossiper.java (line 228) InetAddress /192.168.125.51 is now dead.
 INFO [GossipStage:1] 2011-04-29 07:25:20,128 Gossiper.java (line 609) InetAddress /192.168.125.51 is now UP
----------
 INFO [FlushWriter:1] 2011-04-29 07:25:00,135 Memtable.java (line 172) Completed flushing /data/cassandra/data/Keyspace1/Standard1-f-4041-Data.db (76170930 bytes)
 INFO [COMMIT-LOG-WRITER] 2011-04-29 07:25:00,140 CommitLog.java (line 440) Discarding obsolete commit log:CommitLogSegment(/commitlog/CommitLog-1304033061025.log)
 INFO [ScheduledTasks:1] 2011-04-29 07:25:20,399 GCInspector.java (line 128) GC for ParNew: 511 ms, -71757976 reclaimed leaving 6614730672 used; max is 8466202624
 INFO [ScheduledTasks:1] 2011-04-29 07:25:34,632 GCInspector.java (line 128) GC for ParNew: 238 ms, 612197280 reclaimed leaving 4429643304 used; max is 8466202624
----------

 INFO [GossipTasks:1] 2011-04-29 07:17:48,833 Gossiper.java (line 228) InetAddress /192.168.125.51 is now dead.
 INFO [GossipStage:1] 2011-04-29 07:17:48,854 Gossiper.java (line 609) InetAddress /192.168.125.51 is now UP
----------
 INFO [GossipTasks:1] 2011-04-29 07:17:33,503 Gossiper.java (line 228) InetAddress /192.168.125.49 is now dead.
 INFO [MutationStage:3] 2011-04-29 07:17:47,751 ColumnFamilyStore.java (line 1064) Enqueuing flush of Memtable-Standard1@461979209(61555223 bytes, 1218975 operations)
 INFO [FlushWriter:1] 2011-04-29 07:17:47,751 Memtable.java (line 157) Writing Memtable-Standard1@461979209(61555223 bytes, 1218975 operations)
 INFO [GossipStage:1] 2011-04-29 07:17:49,123 Gossiper.java (line 609) InetAddress /192.168.125.49 is now UP
----------

 INFO [GossipTasks:1] 2011-04-29 07:09:46,072 Gossiper.java (line 228) InetAddress /192.168.125.51 is now dead.
 INFO [GossipStage:1] 2011-04-29 07:09:59,587 Gossiper.java (line 609) InetAddress /192.168.125.51 is now UP
----------
 INFO [FlushWriter:1] 2011-04-29 07:09:36,917 Memtable.java (line 172) Completed flushing /data/cassandra/data/Keyspace1/Standard1-f-4005-Data.db (76199863 bytes)
 INFO [COMMIT-LOG-WRITER] 2011-04-29 07:09:36,921 CommitLog.java (line 440) Discarding obsolete commit log:CommitLogSegment(/commitlog/CommitLog-1304032122510.log)
 INFO [COMMIT-LOG-WRITER] 2011-04-29 07:09:38,089 CommitLogSegment.java (line 50) Creating new commitlog segment /commitlog/CommitLog-1304032178089.log
 INFO [ScheduledTasks:1] 2011-04-29 07:09:59,858 GCInspector.java (line 128) GC for ParNew: 511 ms, -61586864 reclaimed leaving 6496071272 used; max is 8466202624
 WARN [ScheduledTasks:1] 2011-04-29 07:09:59,858 MessagingService.java (line 504) Dropped 1 REQUEST_RESPONSE messages in the last 5000ms
 INFO [ScheduledTasks:1] 2011-04-29 07:09:59,859 StatusLogger.java (line 51) Pool Name                    Active   Pending
 INFO [ScheduledTasks:1] 2011-04-29 07:09:59,859 StatusLogger.java (line 66) ReadStage                         0         0
 INFO [ScheduledTasks:1] 2011-04-29 07:09:59,859 StatusLogger.java (line 66) RequestResponseStage              0         0
 INFO [ScheduledTasks:1] 2011-04-29 07:09:59,860 StatusLogger.java (line 66) ReadRepairStage                   0         0
 INFO [ScheduledTasks:1] 2011-04-29 07:09:59,860 StatusLogger.java (line 66) MutationStage                     0         0
 INFO [ScheduledTasks:1] 2011-04-29 07:09:59,860 StatusLogger.java (line 66) GossipStage                       0         0
 INFO [ScheduledTasks:1] 2011-04-29 07:09:59,860 StatusLogger.java (line 66) AntiEntropyStage                  0         0
 INFO [ScheduledTasks:1] 2011-04-29 07:09:59,861 StatusLogger.java (line 66) MigrationStage                    0         0
 INFO [ScheduledTasks:1] 2011-04-29 07:09:59,861 StatusLogger.java (line 66) StreamStage                       0         0
 INFO [ScheduledTasks:1] 2011-04-29 07:09:59,861 StatusLogger.java (line 66) MemtablePostFlusher               0         0
 INFO [ScheduledTasks:1] 2011-04-29 07:09:59,861 StatusLogger.java (line 66) FILEUTILS-DELETE-POOL             0         0
 INFO [ScheduledTasks:1] 2011-04-29 07:09:59,862 StatusLogger.java (line 66) FlushWriter                       0         0
 INFO [ScheduledTasks:1] 2011-04-29 07:09:59,862 StatusLogger.java (line 66) MiscStage                         0         0
 INFO [ScheduledTasks:1] 2011-04-29 07:09:59,862 StatusLogger.java (line 66) FlushSorter                       0         0
 INFO [ScheduledTasks:1] 2011-04-29 07:09:59,863 StatusLogger.java (line 66) InternalResponseStage             0         0
 INFO [ScheduledTasks:1] 2011-04-29 07:09:59,863 StatusLogger.java (line 66) HintedHandoff                     1         2
 INFO [ScheduledTasks:1] 2011-04-29 07:09:59,863 StatusLogger.java (line 70) CompactionManager               n/a      4067
 INFO [ScheduledTasks:1] 2011-04-29 07:09:59,863 StatusLogger.java (line 82) MessagingService                n/a       0,0
 INFO [ScheduledTasks:1] 2011-04-29 07:09:59,863 StatusLogger.java (line 86) ColumnFamily                Memtable ops,data  Row cache size/cap  Key cache size/cap
 INFO [ScheduledTasks:1] 2011-04-29 07:09:59,864 StatusLogger.java (line 89) Keyspace1.Super1                          0,0                 0/0            0/200000
 INFO [ScheduledTasks:1] 2011-04-29 07:09:59,864 StatusLogger.java (line 89) Keyspace1.Standard1           410835,20744126                 0/0        36172/200000
 INFO [ScheduledTasks:1] 2011-04-29 07:09:59,864 StatusLogger.java (line 89) system.IndexInfo                          0,0                 0/0                 0/1
 INFO [ScheduledTasks:1] 2011-04-29 07:09:59,864 StatusLogger.java (line 89) system.LocationInfo                      1,20                 0/0                 2/2
 INFO [ScheduledTasks:1] 2011-04-29 07:09:59,865 StatusLogger.java (line 89) system.Migrations                         0,0                 0/0                 0/2
 INFO [ScheduledTasks:1] 2011-04-29 07:09:59,865 StatusLogger.java (line 89) system.HintsColumnFamily       206378,8667876                 0/0                0/26
 INFO [ScheduledTasks:1] 2011-04-29 07:09:59,865 StatusLogger.java (line 89) system.Schema                             0,0                 0/0                 0/2
 INFO [ScheduledTasks:1] 2011-04-29 07:10:06,872 GCInspector.java (line 128) GC for ParNew: 219 ms, 589477312 reclaimed leaving 4204479728 used; max is 8466202624
---------- 
 
 INFO [GossipTasks:1] 2011-04-29 07:03:43,262 Gossiper.java (line 228) InetAddress /192.168.125.51 is now dead.
 INFO [GossipStage:1] 2011-04-29 07:03:43,270 Gossiper.java (line 609) InetAddress /192.168.125.51 is now UP
 ----------
 INFO [FlushWriter:1] 2011-04-29 07:03:41,061 Memtable.java (line 157) Writing Memtable-Standard1@595540423(61557029 bytes, 1218975 operations)
 INFO [GossipStage:1] 2011-04-29 07:03:43,538 Gossiper.java (line 609) InetAddress /192.168.125.49 is now UP
 INFO [FlushWriter:1] 2011-04-29 07:03:46,550 Memtable.java (line 172) Completed flushing /data/cassandra/data/Keyspace1/Standard1-f-3991-Data.db (76184729 bytes)
 


2011/4/28 Jonathan Ellis <jbellis@gmail.com>
This means a node was too busy with something else to send out its
heartbeat. Sometimes this is STW GC. Other times it is a bug (one was
fixed for 0.7.6 in
https://issues.apache.org/jira/browse/CASSANDRA-2554).

On Thu, Apr 28, 2011 at 3:57 AM, Sheng Chen <chensheng2010@gmail.com> wrote:
> Thank you for your advice. Rf>=2 is a good work around.
> I was using 0.7.4 and have updated to the latest 0.7 branch, which includes
> 2554 patch.
> But it doesn't help. I still get lots of UnavailableException after the
> following logs,
>  INFO [GossipTasks:1] 2011-04-28 16:12:17,661 Gossiper.java (line 228)
> InetAddress /192.168.125.49 is now dead.
>  INFO [GossipStage:1] 2011-04-28 16:12:19,627 Gossiper.java (line 609)
> InetAddress /192.168.125.49 is now UP
>  INFO [HintedHandoff:1] 2011-04-28 16:13:11,452 HintedHandOffManager.java
> (line 304) Started hinted handoff for endpoint /192.168.125.49
>  INFO [HintedHandoff:1] 2011-04-28 16:13:11,453 HintedHandOffManager.java
> (line 360) Finished hinted handoff of 0 rows to endpoint /192.168.125.49
> It seems that the gossip failure detection is too sensitive. Is there any
> configuration?
>
>
>
>
>
> 2011/4/27 Sylvain Lebresne <sylvain@datastax.com>
>>
>> On Wed, Apr 27, 2011 at 10:32 AM, Sheng Chen <chensheng2010@gmail.com>
>> wrote:
>> > I succeeded to insert 1 billion records into a single node cassandra,
>> >>> bin/stress -d cas01 -o insert -n 1000000000 -c 5 -S 34 -C5 -t 20
>> > Inserts finished in about 14 hours at a speed of 20k/sec.
>> > But when I added another node, tests always failed with
>> > UnavailableException
>> > in an hour.
>> >>> bin/stress -d cas01,cas02 -o insert -n 1000000000 -c 5 -S 34 -C5 -t 20
>> > Writes speed is also 20k/sec because of the bottleneck in the client, so
>> > the
>> > pressure on each server node should be 50% of the single node test.
>> > Why couldn't they handle?
>> > By default, rf=1, consistency=ONE
>> > Some information that may be helpful,
>> > 1. no warn/error in log file, the cluster is still alive after those
>> > exception
>> > 2. the last logs on both nodes happen to be a compaction complete info
>> > 3. gossip log shows one node is dead and then up again in 3 seconds
>>
>> That's your problem. Once marked down (and since rf=1), when an update for
>> cas02 reach cas01 and cas01 has marked cas02 down, it will throw the
>> UnavailableException.
>>
>> Now, it shouldn't have been marked down and I suspect this is due to
>> https://issues.apache.org/jira/browse/CASSANDRA-2554
>> (even though you didn't tell which version you're using, I suppose
>> this is a 0.7.*).
>>
>> If you apply this patch or use the svn current 0.7 branch, that should
>> hopefully
>> not happen again.
>>
>> Note that if you had rf >= 2, the node would still have been marked down
>> wrongly
>> for 3 seconds, but that would have been transparent to the stress test.
>>
>> > 4. I set hinted_handoff_enabled: false, but still see lots of handoff
>> > logs
>>
>> What are those saying ?
>>
>> --
>> Sylvain
>
>



--
Jonathan Ellis
Project Chair, Apache Cassandra
co-founder of DataStax, the source for professional Cassandra support
http://www.datastax.com