cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Anuj Wadehra <anujw_2...@yahoo.co.in>
Subject Re: New node has high network and disk usage.
Date Wed, 06 Jan 2016 17:31:43 GMT
Hi Vickrum,
I would have proceeded with diagnosis as follows:
1. Analysis of sar report to check system health -cpu memory swap disk etc. System seems
to be overloaded. This is evident from mutation drops.
2. Make sure that  all recommended Cassandra production settings available at Datastax site
are applied ,disable zone reclaim and THP.
3.Run full Repair on bad node and check data size. Node is owner of maximum token range but
has significant lower data.I doubt that bootstrapping happened properly.
4.Compactionstats shows 22 pending compactions. Try throttling compactions via reducing cincurent
compactors or compaction throughput.
5.Analyze logs to make sure bootstrapping happened without errors.
6. Look for other common performance problems such as GC pauses to make sure that dropped
mutations are not caused by GC pauses.

ThanksAnuj

Sent from Yahoo Mail on Android 
 
  On Wed, 6 Jan, 2016 at 10:12 pm, Vickrum Loi<vickrum.loi@idioplatform.com> wrote:
  # nodetool compactionstats
pending tasks: 22
          compaction type        keyspace           table      
completed           total      unit  progress
               Compactionproduction_analytics    interactions      
240410213    161172668724     bytes     0.15%
               Compactionproduction_decisionsdecisions.decisions_q_idx      
120815385       226295183     bytes    53.39%
Active compaction remaining time :   2h39m58s

Worth mentioning that compactions haven't been running on this node particularly often. The
node's been performing badly regardless of whether it's compacting or not.

On 6 January 2016 at 16:35, Jeff Ferland <jbf@tubularlabs.com> wrote:

What’s your output of `nodetool compactionstats`?

On Jan 6, 2016, at 7:26 AM, Vickrum Loi <vickrum.loi@idioplatform.com> wrote:
Hi,

We recently added a new node to our cluster in order to replace a node that died (hardware
failure we believe). For the next two weeks it had high disk and network activity. We replaced
the server, but it's happened again. We've looked into memory allowances, disk performance,
number of connections, and all the nodetool stats, but can't find the cause of the issue.

`nodetool tpstats`[0] shows a lot of active and pending threads, in comparison to the rest
of the cluster, but that's likely a symptom, not a cause.

`nodetool status`[1] shows the cluster isn't quite balanced. The bad node (D) has less data.

Disk Activity[2] and Network activity[3] on this node is far higher than the rest.

The only other difference this node has to the rest of the cluster is that its on the ext4
filesystem, whereas the rest are ext3, but we've done plenty of testing there and can't see
how that would affect performance on this node so much.

Nothing of note in system.log.

What should our next step be in trying to diagnose this issue?

Best wishes,
Vic

[0] `nodetool tpstats` output:

Good node:
    Pool Name                    Active   Pending      Completed  
Blocked  All time blocked
    ReadStage                         0         0      
46311521         0                 0
    RequestResponseStage              0         0       23817366        
0                 0
    MutationStage                     0         0      
47389269         0                 0
    ReadRepairStage                   0         0         
11108         0                 0
    ReplicateOnWriteStage             0         0             
0         0                 0
    GossipStage                       0         0       
5259908         0                 0
    CacheCleanupExecutor              0         0             
0         0                 0
    MigrationStage                    0         0            
30         0                 0
    MemoryMeter                       0         0         
16563         0                 0
    FlushWriter                       0         0         
39637         0                26
    ValidationExecutor                0         0         
19013         0                 0
    InternalResponseStage             0         0             
9         0                 0
    AntiEntropyStage                  0         0         
38026         0                 0
    MemtablePostFlusher               0         0         
81740         0                 0
    MiscStage                         0         0         
19196         0                 0
    PendingRangeCalculator            0         0            
23         0                 0
    CompactionExecutor                0         0         
61629         0                 0
    commitlog_archiver                0         0             
0         0                 0
    HintedHandoff                     0         0            
63         0                 0

    Message type           Dropped
    RANGE_SLICE                  0
    READ_REPAIR                  0
    PAGED_RANGE                  0
    BINARY                       0
    READ                       640
    MUTATION                     0
    _TRACE                       0
    REQUEST_RESPONSE             0
    COUNTER_MUTATION             0

Bad node:
    Pool Name                    Active   Pending      Completed  
Blocked  All time blocked
    ReadStage                        32       113         
52216         0                 0
    RequestResponseStage              0         0          
4167         0                 0
    MutationStage                     0         0        
127559         0                 0
    ReadRepairStage                   0         0           
125         0                 0
    ReplicateOnWriteStage             0         0             
0         0                 0
    GossipStage                       0         0          
9965         0                 0
    CacheCleanupExecutor              0         0             
0         0                 0
    MigrationStage                    0         0             
0         0                 0
    MemoryMeter                       0         0            
24         0                 0
    FlushWriter                       0         0            
27         0                 1
    ValidationExecutor                0         0             
0         0                 0
    InternalResponseStage             0         0             
0         0                 0
    AntiEntropyStage                  0         0             
0         0                 0
    MemtablePostFlusher               0         0            
96         0                 0
    MiscStage                         0         0             
0         0                 0
    PendingRangeCalculator            0         0            
10         0                 0
    CompactionExecutor                1         1            
73         0                 0
    commitlog_archiver                0         0             
0         0                 0
    HintedHandoff                     0         0            
15         0                 0

    Message type           Dropped
    RANGE_SLICE                130
    READ_REPAIR                  1
    PAGED_RANGE                  0
    BINARY                       0
    READ                     31032
    MUTATION                   865
    _TRACE                       0
    REQUEST_RESPONSE             7
    COUNTER_MUTATION             0


[1] `nodetool status` output:

    Status=Up/Down
    |/ State=Normal/Leaving/Joining/Moving
    --  Address         Load       Tokens  Owns   Host ID                              
Rack
    UN  A (Good)        252.37 GB  256     23.0%  9cd2e58c-a062-48a4-8d3f-b7bd9ee0576f 
rack1
    UN  B (Good)        245.91 GB  256     24.4%  6f0cfff2-babe-4de2-a1e3-6201228dee44 
rack1
    UN  C (Good)        254.79 GB  256     23.7%  f4891729-9179-4f19-ab2c-50d387da7ac6 
rack1
    UN  D (Bad)         163.85 GB  256     28.8%  faa5b073-6af4-4c80-b280-e7fdd61924d3 
rack1

[2] Disk read/write ops:

    https://s3-eu-west-1.amazonaws.com/uploads-eu.hipchat.com/28299/178477/dRs4jV1ukMeFHGE/cass-disk-read-ops.png
    https://s3-eu-west-1.amazonaws.com/uploads-eu.hipchat.com/28299/178477/gbE58N2WosiOomF/cass-disk-write-ops.png

[3] Network in/out:

    https://s3-eu-west-1.amazonaws.com/uploads-eu.hipchat.com/28299/178477/RwOVdUBxu6fPLgF/cass-network-in.png
    https://s3-eu-west-1.amazonaws.com/uploads-eu.hipchat.com/28299/178477/OpZM6ypNVN0O30q/cass-network-out.png




  
Mime
View raw message