incubator-cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From aaron morton <aa...@thelastpickle.com>
Subject Re: Heavy writes ok for single node, but failed for cluster
Date Fri, 29 Apr 2011 21:25:18 GMT
Can you try turning up the logging to TRACE, if you know how turn it up for the org.apache.cassandra.gms.FailureDetector
and org.apache.cassandra.gms.Gossiper loggers

You'll see a lot of messages. Some of the interesting ones will be...

"FailureDetector.java (line 128) reporting /127.0.0.2"  - logged when the end point reports

"FailureDetector.java (line 149) PHI for /127.0.0.2 : 0.9127981004059462" - logged when an
end point is checked before possibly marking dead

There will also be messages from the gossiper about receiving and sending.

Aaron
 
On 29 Apr 2011, at 14:50, Sheng Chen wrote:

> Thank you for your patch. 
> I believe the latter version I used (the latest 0.7 branch) includes the patch, but the
problem remains.
> 
> Is there anything else that may block this heartbeat, like gc?
> Here are some logs during heartbeat failure.
> 
>  INFO [GossipTasks:1] 2011-04-29 07:25:09,716 Gossiper.java (line 228) InetAddress /192.168.125.51
is now dead.
>  INFO [GossipStage:1] 2011-04-29 07:25:20,128 Gossiper.java (line 609) InetAddress /192.168.125.51
is now UP
> ----------
>  INFO [FlushWriter:1] 2011-04-29 07:25:00,135 Memtable.java (line 172) Completed flushing
/data/cassandra/data/Keyspace1/Standard1-f-4041-Data.db (76170930 bytes)
>  INFO [COMMIT-LOG-WRITER] 2011-04-29 07:25:00,140 CommitLog.java (line 440) Discarding
obsolete commit log:CommitLogSegment(/commitlog/CommitLog-1304033061025.log)
>  INFO [ScheduledTasks:1] 2011-04-29 07:25:20,399 GCInspector.java (line 128) GC for ParNew:
511 ms, -71757976 reclaimed leaving 6614730672 used; max is 8466202624
>  INFO [ScheduledTasks:1] 2011-04-29 07:25:34,632 GCInspector.java (line 128) GC for ParNew:
238 ms, 612197280 reclaimed leaving 4429643304 used; max is 8466202624
> ----------
> 
>  INFO [GossipTasks:1] 2011-04-29 07:17:48,833 Gossiper.java (line 228) InetAddress /192.168.125.51
is now dead.
>  INFO [GossipStage:1] 2011-04-29 07:17:48,854 Gossiper.java (line 609) InetAddress /192.168.125.51
is now UP
> ----------
>  INFO [GossipTasks:1] 2011-04-29 07:17:33,503 Gossiper.java (line 228) InetAddress /192.168.125.49
is now dead.
>  INFO [MutationStage:3] 2011-04-29 07:17:47,751 ColumnFamilyStore.java (line 1064) Enqueuing
flush of Memtable-Standard1@461979209(61555223 bytes, 1218975 operations)
>  INFO [FlushWriter:1] 2011-04-29 07:17:47,751 Memtable.java (line 157) Writing Memtable-Standard1@461979209(61555223
bytes, 1218975 operations)
>  INFO [GossipStage:1] 2011-04-29 07:17:49,123 Gossiper.java (line 609) InetAddress /192.168.125.49
is now UP
> ----------
> 
>  INFO [GossipTasks:1] 2011-04-29 07:09:46,072 Gossiper.java (line 228) InetAddress /192.168.125.51
is now dead.
>  INFO [GossipStage:1] 2011-04-29 07:09:59,587 Gossiper.java (line 609) InetAddress /192.168.125.51
is now UP
> ----------
>  INFO [FlushWriter:1] 2011-04-29 07:09:36,917 Memtable.java (line 172) Completed flushing
/data/cassandra/data/Keyspace1/Standard1-f-4005-Data.db (76199863 bytes)
>  INFO [COMMIT-LOG-WRITER] 2011-04-29 07:09:36,921 CommitLog.java (line 440) Discarding
obsolete commit log:CommitLogSegment(/commitlog/CommitLog-1304032122510.log)
>  INFO [COMMIT-LOG-WRITER] 2011-04-29 07:09:38,089 CommitLogSegment.java (line 50) Creating
new commitlog segment /commitlog/CommitLog-1304032178089.log
>  INFO [ScheduledTasks:1] 2011-04-29 07:09:59,858 GCInspector.java (line 128) GC for ParNew:
511 ms, -61586864 reclaimed leaving 6496071272 used; max is 8466202624
>  WARN [ScheduledTasks:1] 2011-04-29 07:09:59,858 MessagingService.java (line 504) Dropped
1 REQUEST_RESPONSE messages in the last 5000ms
>  INFO [ScheduledTasks:1] 2011-04-29 07:09:59,859 StatusLogger.java (line 51) Pool Name
                   Active   Pending
>  INFO [ScheduledTasks:1] 2011-04-29 07:09:59,859 StatusLogger.java (line 66) ReadStage
                        0         0
>  INFO [ScheduledTasks:1] 2011-04-29 07:09:59,859 StatusLogger.java (line 66) RequestResponseStage
             0         0
>  INFO [ScheduledTasks:1] 2011-04-29 07:09:59,860 StatusLogger.java (line 66) ReadRepairStage
                  0         0
>  INFO [ScheduledTasks:1] 2011-04-29 07:09:59,860 StatusLogger.java (line 66) MutationStage
                    0         0
>  INFO [ScheduledTasks:1] 2011-04-29 07:09:59,860 StatusLogger.java (line 66) GossipStage
                      0         0
>  INFO [ScheduledTasks:1] 2011-04-29 07:09:59,860 StatusLogger.java (line 66) AntiEntropyStage
                 0         0
>  INFO [ScheduledTasks:1] 2011-04-29 07:09:59,861 StatusLogger.java (line 66) MigrationStage
                   0         0
>  INFO [ScheduledTasks:1] 2011-04-29 07:09:59,861 StatusLogger.java (line 66) StreamStage
                      0         0
>  INFO [ScheduledTasks:1] 2011-04-29 07:09:59,861 StatusLogger.java (line 66) MemtablePostFlusher
              0         0
>  INFO [ScheduledTasks:1] 2011-04-29 07:09:59,861 StatusLogger.java (line 66) FILEUTILS-DELETE-POOL
            0         0
>  INFO [ScheduledTasks:1] 2011-04-29 07:09:59,862 StatusLogger.java (line 66) FlushWriter
                      0         0
>  INFO [ScheduledTasks:1] 2011-04-29 07:09:59,862 StatusLogger.java (line 66) MiscStage
                        0         0
>  INFO [ScheduledTasks:1] 2011-04-29 07:09:59,862 StatusLogger.java (line 66) FlushSorter
                      0         0
>  INFO [ScheduledTasks:1] 2011-04-29 07:09:59,863 StatusLogger.java (line 66) InternalResponseStage
            0         0
>  INFO [ScheduledTasks:1] 2011-04-29 07:09:59,863 StatusLogger.java (line 66) HintedHandoff
                    1         2
>  INFO [ScheduledTasks:1] 2011-04-29 07:09:59,863 StatusLogger.java (line 70) CompactionManager
              n/a      4067
>  INFO [ScheduledTasks:1] 2011-04-29 07:09:59,863 StatusLogger.java (line 82) MessagingService
               n/a       0,0
>  INFO [ScheduledTasks:1] 2011-04-29 07:09:59,863 StatusLogger.java (line 86) ColumnFamily
               Memtable ops,data  Row cache size/cap  Key cache size/cap
>  INFO [ScheduledTasks:1] 2011-04-29 07:09:59,864 StatusLogger.java (line 89) Keyspace1.Super1
                         0,0                 0/0            0/200000
>  INFO [ScheduledTasks:1] 2011-04-29 07:09:59,864 StatusLogger.java (line 89) Keyspace1.Standard1
          410835,20744126                 0/0        36172/200000
>  INFO [ScheduledTasks:1] 2011-04-29 07:09:59,864 StatusLogger.java (line 89) system.IndexInfo
                         0,0                 0/0                 0/1
>  INFO [ScheduledTasks:1] 2011-04-29 07:09:59,864 StatusLogger.java (line 89) system.LocationInfo
                     1,20                 0/0                 2/2
>  INFO [ScheduledTasks:1] 2011-04-29 07:09:59,865 StatusLogger.java (line 89) system.Migrations
                        0,0                 0/0                 0/2
>  INFO [ScheduledTasks:1] 2011-04-29 07:09:59,865 StatusLogger.java (line 89) system.HintsColumnFamily
      206378,8667876                 0/0                0/26
>  INFO [ScheduledTasks:1] 2011-04-29 07:09:59,865 StatusLogger.java (line 89) system.Schema
                            0,0                 0/0                 0/2
>  INFO [ScheduledTasks:1] 2011-04-29 07:10:06,872 GCInspector.java (line 128) GC for ParNew:
219 ms, 589477312 reclaimed leaving 4204479728 used; max is 8466202624
> ---------- 
>  
>  INFO [GossipTasks:1] 2011-04-29 07:03:43,262 Gossiper.java (line 228) InetAddress /192.168.125.51
is now dead.
>  INFO [GossipStage:1] 2011-04-29 07:03:43,270 Gossiper.java (line 609) InetAddress /192.168.125.51
is now UP
>  ----------
>  INFO [FlushWriter:1] 2011-04-29 07:03:41,061 Memtable.java (line 157) Writing Memtable-Standard1@595540423(61557029
bytes, 1218975 operations)
>  INFO [GossipStage:1] 2011-04-29 07:03:43,538 Gossiper.java (line 609) InetAddress /192.168.125.49
is now UP
>  INFO [FlushWriter:1] 2011-04-29 07:03:46,550 Memtable.java (line 172) Completed flushing
/data/cassandra/data/Keyspace1/Standard1-f-3991-Data.db (76184729 bytes)
>  
> 
> 
> 2011/4/28 Jonathan Ellis <jbellis@gmail.com>
> This means a node was too busy with something else to send out its
> heartbeat. Sometimes this is STW GC. Other times it is a bug (one was
> fixed for 0.7.6 in
> https://issues.apache.org/jira/browse/CASSANDRA-2554).
> 
> On Thu, Apr 28, 2011 at 3:57 AM, Sheng Chen <chensheng2010@gmail.com> wrote:
> > Thank you for your advice. Rf>=2 is a good work around.
> > I was using 0.7.4 and have updated to the latest 0.7 branch, which includes
> > 2554 patch.
> > But it doesn't help. I still get lots of UnavailableException after the
> > following logs,
> >  INFO [GossipTasks:1] 2011-04-28 16:12:17,661 Gossiper.java (line 228)
> > InetAddress /192.168.125.49 is now dead.
> >  INFO [GossipStage:1] 2011-04-28 16:12:19,627 Gossiper.java (line 609)
> > InetAddress /192.168.125.49 is now UP
> >  INFO [HintedHandoff:1] 2011-04-28 16:13:11,452 HintedHandOffManager.java
> > (line 304) Started hinted handoff for endpoint /192.168.125.49
> >  INFO [HintedHandoff:1] 2011-04-28 16:13:11,453 HintedHandOffManager.java
> > (line 360) Finished hinted handoff of 0 rows to endpoint /192.168.125.49
> > It seems that the gossip failure detection is too sensitive. Is there any
> > configuration?
> >
> >
> >
> >
> >
> > 2011/4/27 Sylvain Lebresne <sylvain@datastax.com>
> >>
> >> On Wed, Apr 27, 2011 at 10:32 AM, Sheng Chen <chensheng2010@gmail.com>
> >> wrote:
> >> > I succeeded to insert 1 billion records into a single node cassandra,
> >> >>> bin/stress -d cas01 -o insert -n 1000000000 -c 5 -S 34 -C5 -t 20
> >> > Inserts finished in about 14 hours at a speed of 20k/sec.
> >> > But when I added another node, tests always failed with
> >> > UnavailableException
> >> > in an hour.
> >> >>> bin/stress -d cas01,cas02 -o insert -n 1000000000 -c 5 -S 34 -C5
-t 20
> >> > Writes speed is also 20k/sec because of the bottleneck in the client, so
> >> > the
> >> > pressure on each server node should be 50% of the single node test.
> >> > Why couldn't they handle?
> >> > By default, rf=1, consistency=ONE
> >> > Some information that may be helpful,
> >> > 1. no warn/error in log file, the cluster is still alive after those
> >> > exception
> >> > 2. the last logs on both nodes happen to be a compaction complete info
> >> > 3. gossip log shows one node is dead and then up again in 3 seconds
> >>
> >> That's your problem. Once marked down (and since rf=1), when an update for
> >> cas02 reach cas01 and cas01 has marked cas02 down, it will throw the
> >> UnavailableException.
> >>
> >> Now, it shouldn't have been marked down and I suspect this is due to
> >> https://issues.apache.org/jira/browse/CASSANDRA-2554
> >> (even though you didn't tell which version you're using, I suppose
> >> this is a 0.7.*).
> >>
> >> If you apply this patch or use the svn current 0.7 branch, that should
> >> hopefully
> >> not happen again.
> >>
> >> Note that if you had rf >= 2, the node would still have been marked down
> >> wrongly
> >> for 3 seconds, but that would have been transparent to the stress test.
> >>
> >> > 4. I set hinted_handoff_enabled: false, but still see lots of handoff
> >> > logs
> >>
> >> What are those saying ?
> >>
> >> --
> >> Sylvain
> >
> >
> 
> 
> 
> --
> Jonathan Ellis
> Project Chair, Apache Cassandra
> co-founder of DataStax, the source for professional Cassandra support
> http://www.datastax.com
> 


Mime
View raw message