FYI https://issues.apache.org/jira/browse/CASSANDRA-5483 :)


yuki

On May 2, 2013 at May 2, 2013, Wei Zhu (wz1975@yahoo.com) wrote:

It would be helpful to have something like "nodetool repairstatus" which reports back the status of existing repair session and recently completed repair session.
Right now, I have to grep the log, check compactionstatus, tpstatus, netstatus to find out what's going on with the repair.

-Wei


From: "Haithem Jarraya" <haithem.jarraya@struq.com>
To: user@cassandra.apache.org
Sent: Thursday, May 2, 2013 6:39:37 AM
Subject: Re: Repair Hanging C* 1.2.4

+1 for moshe,
I found that a cluster in a strange state due to just one node failing, defeat the USPs of C* such as fault tolerance, data replication between DC...



On 2 May 2013 14:27, <moshe.kranc@barclays.com> wrote:

Nothing kills a C* ring like an unstable node that won’t die but drives the other nodes crazy.

 

Here’s an idea for a new feature:

The Living Will: A configurable set of error conditions (e.g., disk full, heap space filling up), which, if they happen, make the node voluntarily shut down.

 

From: Haithem Jarraya [mailto:haithem.jarraya@struq.com]
Sent: Thursday, May 02, 2013 4:13 PM
To: user@cassandra.apache.org
Subject: Re: Repair Hanging C* 1.2.4

 

Hi Yuki,

 

I think I found what went wrong, one box in WDC had the disk full during scrub. The box become unusable but kept alive status.

It was not supposed to go down or marked as dead?

By the way can we specify the snpashot output directory in nodetool command, so that does not happen again?

 

Thanks,

 

H

 

On 2 May 2013 13:17, Yuki Morishita <mor.yuki@gmail.com> wrote:

Hi,

 

ERROR [Thread-12725] 2013-05-01 14:30:54,304 StorageService.java (line 2420) Repair session failed:

java.lang.IllegalArgumentException: Requested range intersects a local range but is not fully contained in one; this would lead to imprecise repair

 

This error means you are repairing the range that spreads across multiple (virtual) nodes.

I think this won't happen unless you specify the repair range with -st and -et option.

 

How do you start repair?

 

--
Yuki Morishita
Sent with Airmail

On May 2, 2013 at May 2, 2013, Haithem Jarraya (haithem.jarraya@struq.com) wrote:

Hi All,

 

Cassandra repair has been a real pain for us and it’s holding back our migration from mongo for quiet sometimes now.

We saw errors like this during the repair,

 INFO [AntiEntropyStage:1] 2013-05-01 14:30:54,300 AntiEntropyService.java (line 764) [repair #ed104480-b26a-11e2-af9b-05179fa66b76] mycolumnfamily is fully synced (1 remaining column family to sync for this session)

ERROR [Thread-12725] 2013-05-01 14:30:54,304 StorageService.java (line 2420) Repair session failed:

java.lang.IllegalArgumentException: Requested range intersects a local range but is not fully contained in one; this would lead to imprecise repair

        at org.apache.cassandra.service.AntiEntropyService.getNeighbors(AntiEntropyService.java:175)

        at org.apache.cassandra.service.AntiEntropyService$RepairSession.<init>(AntiEntropyService.java:621)

        at org.apache.cassandra.service.AntiEntropyService$RepairSession.<init>(AntiEntropyService.java:610)

        at org.apache.cassandra.service.AntiEntropyService.submitRepairSession(AntiEntropyService.java:127)

        at org.apache.cassandra.service.StorageService.forceTableRepair(StorageService.java:2480)

        at org.apache.cassandra.service.StorageService$4.runMayThrow(StorageService.java:2416)

        at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28)

        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:439)

        at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)

        at java.util.concurrent.FutureTask.run(FutureTask.java:138)

        at java.lang.Thread.run(Thread.java:662)

 

 

Ok we might have gone beyond the GCGrades(again before repair do not complete) So we ran scrub  in all node in parallel as it was suggested in this mailing list.

I am not sure if this can be the cause of the problem or not, but in reality we had this issue of repair not completing and hanging from the day we started testing cassandra 1.2.2, same issue happening with every upgrade 1.2.3 and now 1.2.4.

I want a way to kick in the repair  if they hang or cancel the previous one without restarting the cluster, we can’t afford to do that the day we go live.

 

Let me start by presenting our current configuration.

Data Centers:

2 Data center (Amsterdam 6 nodes with RF of 3, Washington D.C with RF of 1)

1 Key space with 3 column families ~= 100GB of data.

Each node running Cassandra 1.2.4 with Java6_update45 running centos 2.6 with 32GB of RAM, 24 Cores @2.00GHZ, JNA v3.2.4 installed, 2 disk( 1 rotational for os and commit logs, and 1 ssd for the data). We are getting really good read performances, 99% < 10ms,  95% < 5ms.

 

nodetool status

Datacenter: ams01

=================

Status=Up/Down

|/ State=Normal/Leaving/Joining/Moving

--  Address       Load       Tokens  Owns   Host ID                               Rack

UN  x.x.x.23   34.04 GB   256     13.1%  4a7bc489-25af-4c20-80f8-499ffcb18e2d  RAC1

UN  x.x.x.79    28.53 GB   256     12.6%  98a1167f-cf75-4201-a454-695e0f7d2d72  RAC1

UN  x.x.x.78    41.31 GB   256     11.9%  62a418b5-3c38-4f66-874d-8138d6d565e5  RAC1

UN  x.x.x.66   54.41 GB   256     13.8%  ab564d16-4081-4866-b8ba-26461d9a93d7  RAC1

UN  x.x.x.91    45.92 GB   256     12.6%  2e1e7179-82e6-4ae6-b986-383acc9fc8a2  RAC1

UN  x.x.x.126  37.31 GB   256     11.8%  d4bed3b1-ffaf-4c68-b560-d270355c8c4b  RAC1

Datacenter: wdc01

=================

Status=Up/Down

|/ State=Normal/Leaving/Joining/Moving

--  Address       Load       Tokens  Owns   Host ID                               Rack

UN  x.x.x.144   30.64 GB   256     12.0%  1860011e-fa7c-4ce1-ad6b-c8a38a5ddd02  RAC1

UN  x.x.x.140   86.05 GB   256     12.3%  f3fa985d-5056-4ddc-b146-d02432c3a86e  RAC1

 

nodetool  status mykeyspace

Datacenter: ams01

=================

Status=Up/Down

|/ State=Normal/Leaving/Joining/Moving

--  Address       Load       Tokens  Owns (effective)  Host ID                               Rack

UN  x.x.x.66   54.41 GB   256     53.6%             ab564d16-4081-4866-b8ba-26461d9a93d7  RAC1

UN  x.x.x.91    45.92 GB   256     52.1%             2e1e7179-82e6-4ae6-b986-383acc9fc8a2  RAC1

UN  x.x.x.126  37.31 GB   256     47.9%             d4bed3b1-ffaf-4c68-b560-d270355c8c4b  RAC1

UN  x.x.x.23   34.04 GB   256     50.9%             4a7bc489-25af-4c20-80f8-499ffcb18e2d  RAC1

UN  x.x.x.79    28.53 GB   256     47.4%             98a1167f-cf75-4201-a454-695e0f7d2d72  RAC1

UN  x.x.x.78    41.31 GB   256     48.0%             62a418b5-3c38-4f66-874d-8138d6d565e5  RAC1

Datacenter: wdc01

=================

Status=Up/Down

|/ State=Normal/Leaving/Joining/Moving

--  Address       Load       Tokens  Owns (effective)  Host ID                               Rack

UN  x.x.x140   86.05 GB   256     51.5%             f3fa985d-5056-4ddc-b146-d02432c3a86e  RAC1

UN  x.x.x.144   30.64 GB   256     48.5%             1860011e-fa7c-4ce1-ad6b-c8a38a5ddd02  RAC1

 

The first thing we notice is that the data distribution is off by few %, well I guess, if we had a repair running with no hanging this should fix the problem.

Now when we run a repair it usually hangs after a day, there is no more repair messages in the logs or stream, cpu usage goes down.

This the output for nodetool tpstats and nodetool netstats in all boxes.

 

Amsterdam data center:

NODE 1:

x.x.x.23 (I run the first repair on this node), we can see the AntiEntropySession 4 Active and 7 pending

nodetool tpstats

Pool Name                    Active   Pending      Completed   Blocked  All time blocked

ReadStage                         0         1       33134051         0                 0

RequestResponseStage              0         0       60655547         0                 0

MutationStage                     0         0       46000521         0                 0

ReadRepairStage                   0         0        2712610         0                 0

ReplicateOnWriteStage             0         0              0         0                 0

GossipStage                       0         0         848258         0                 0

AntiEntropyStage                  0         0          24530         0                 0

MigrationStage                    0         0             56         0                 0

MemtablePostFlusher               0         0           3295         0                 0

FlushWriter                       0         0           1487         0                 0

MiscStage                         0         0            245         0                 0

commitlog_archiver                0         0              0         0                 0

AntiEntropySessions               4         7            891         0                 0

InternalResponseStage             0         0              7         0                 0

HintedHandoff                     0         0              7         0                 0

 

Message type           Dropped

RANGE_SLICE                  0

READ_REPAIR                  0

BINARY                       0

READ                         0

MUTATION                     0

_TRACE                       0

REQUEST_RESPONSE             0

 

For nodetool stats,  it says nothing streaming from x.x.x.140 which is the ip of 1 nodes in WDC datacentre.

nodetool netstats

Mode: NORMAL

Not sending any streams.

Nothing streaming from /x.x.x.140

Pool Name                    Active   Pending      Completed

Commands                        n/a         0       60710552

Responses                       n/a         0       56269681

 

NODE 2:

x.x.x.126 nothing going on this node.

nodetool tpstats

Pool Name                    Active   Pending      Completed   Blocked  All time blocked

ReadStage                         0         0       34896655         0                 0

RequestResponseStage              0         0       74815324         0                 0

MutationStage                     0         0       44827842         0                 0

ReadRepairStage                   0         0        3175404         0                 0

ReplicateOnWriteStage             0         0              0         0                 0

GossipStage                       0         0         971226         0                 0

AntiEntropyStage                  0         0          11904         0                 0

MigrationStage                    0         0             59         0                 0

MemtablePostFlusher               0         0           7029         0                 0

FlushWriter                       0         0           3000         0                 0

MiscStage                         0         0           1300         0                 0

commitlog_archiver                0         0              0         0                 0

AntiEntropySessions               0         0              0         0                 0

InternalResponseStage             0         0              7         0                 0

HintedHandoff                     0         0              8         0                 0

 

Message type           Dropped

RANGE_SLICE                  0

READ_REPAIR                  0

BINARY                       0

READ                         0

MUTATION                     0

_TRACE                       0

REQUEST_RESPONSE             0

 

Same output for netstats as the previous node

nodetool netstats

Mode: NORMAL

Not sending any streams.

Nothing streaming from /x.x.x.140

Pool Name                    Active   Pending      Completed

Commands                        n/a         0       75174907

Responses                       n/a         0       55239904

 

NODE 3:

x.x.x.78 after the repair stopped in NODE 1, I tried to run a repair on this node to see if, it will change things or not (I think I tried to run it twice)

We can see 4 AntiEntropySession Active and 6 Pending

nodetool tpstats

Pool Name                    Active   Pending      Completed   Blocked  All time blocked

ReadStage                         0         0       29928646         0                 0

RequestResponseStage              0         0       81431526         0                 0

MutationStage                     0         0       46631197         0                 0

ReadRepairStage                   0         0        3352193         0                 0

ReplicateOnWriteStage             0         0              0         0                 0

GossipStage                       0         0         994857         0                 0

AntiEntropyStage                  0         0          32477         0                 0

MigrationStage                    0         0             44         0                 0

MemtablePostFlusher               0         0           7521         0                 0

FlushWriter                       0         0           3110         0                 0

MiscStage                         0         0           1023         0                 0

commitlog_archiver                0         0              0         0                 0

AntiEntropySessions               4         6            360         0                 0

InternalResponseStage             0         0             70         0                 0

HintedHandoff                     0         0              9         0                 0

 

Message type           Dropped

RANGE_SLICE                  0

READ_REPAIR                  0

BINARY                       0

READ                         0

MUTATION                     0

_TRACE                       0

REQUEST_RESPONSE             0

 

 

We can see for the netstats that it is waiting

nodetool netstats

Mode: NORMAL

Not sending any streams.

Nothing streaming from /x.x.x.140

Pool Name                    Active   Pending      Completed

Commands                        n/a         0       81478294

Responses                       n/a         3       50728352

 

NODE 4:

x.x.x. 66

nodetool tpstats

Pool Name                    Active   Pending      Completed   Blocked  All time blocked

ReadStage                         0         0       31542526         0                 0

RequestResponseStage              0         0       66173136         0                 0

MutationStage                     0         0       46796311         0                 0

ReadRepairStage                   0         0        2542891         0                 0

ReplicateOnWriteStage             0         0              0         0                 0

GossipStage                       0         0         726267         0                 0

AntiEntropyStage                  0         0           3782         0                 0

MigrationStage                    0         0             50         0                 0

MemtablePostFlusher               0         0           2807         0                 0

FlushWriter                       0         0           1400         0                 2

MiscStage                         0         0            679         0                 0

commitlog_archiver                0         0              0         0                 0

AntiEntropySessions               0         0              0         0                 0

InternalResponseStage             0         0              3         0                 0

HintedHandoff                     0         0              8         0                 0

 

Message type           Dropped

RANGE_SLICE                  0

READ_REPAIR                  0

BINARY                       0

READ                         0

MUTATION                     0

_TRACE                       0

REQUEST_RESPONSE             0

 

nodetool netstats

Mode: NORMAL

Not sending any streams.

Nothing streaming from /x.x.x.140

Pool Name                    Active   Pending      Completed

Commands                        n/a         0       66201849

Responses                       n/a         0       54649566

 

NODE 5

x.x.x.79

nodetool tpstats

Pool Name                    Active   Pending      Completed   Blocked  All time blocked

ReadStage                         0         0        4807546         0                 0

RequestResponseStage              0         0       15208415         0                 0

MutationStage                     0         0       17640854         0                 0

ReadRepairStage                   0         0         208035         0                 0

ReplicateOnWriteStage             0         0              0         0                 0

GossipStage                       0         0         241855         0                 0

AntiEntropyStage                  0         0           1096         0                 0

MigrationStage                    0         0             31         0                 0

MemtablePostFlusher               0         0            801         0                 0

FlushWriter                       0         0            351         0                 0

MiscStage                         0         0            101         0                 0

commitlog_archiver                0         0              0         0                 0

AntiEntropySessions               0         0              0         0                 0

InternalResponseStage             0         0             10         0                 0

HintedHandoff                     0         0              0         0                 0

 

Message type           Dropped

RANGE_SLICE                  0

READ_REPAIR                  0

BINARY                       0

READ                         0

MUTATION                     0

_TRACE                       0

REQUEST_RESPONSE             0

 

# nodetool netstats

Mode: NORMAL

Not sending any streams.

Nothing streaming from /x.x.x.140

Pool Name                    Active   Pending      Completed

Commands                        n/a         0       15216079

Responses                       n/a         2       20432998

 

NODE 6:

x.x.x.91

nodetool tpstats

tPool Name                    Active   Pending      Completed   Blocked  All time blocked

ReadStage                         0         0       50507669         0                 0

RequestResponseStage              0         0       72430667         0                 0

MutationStage                     0         0       47096834         0                 0

ReadRepairStage                   0         0        3135286         0                 0

ReplicateOnWriteStage             0         0              0         0                 0

GossipStage                       0         0         728625         0                 0

AntiEntropyStage                  0         0           3996         0                 0

MigrationStage                    0         0             57         0                 0

MemtablePostFlusher               0         0           2941         0                 0

FlushWriter                       0         0           1453         0                 2

MiscStage                         0         0            743         0                 0

commitlog_archiver                0         0              0         0                 0

AntiEntropySessions               0         0              0         0                 0

InternalResponseStage             0         0              3         0                 0

HintedHandoff                     0         0             10         0                 0

 

Message type           Dropped

RANGE_SLICE                  0

READ_REPAIR                  0

BINARY                       0

READ                         0

MUTATION                     0

_TRACE                       0

REQUEST_RESPONSE             0

 

nodetool netstats

Mode: NORMAL

Not sending any streams.

Nothing streaming from /x.x.x.140

Pool Name                    Active   Pending      Completed

Commands                        n/a         0       73084789

Responses                       n/a         0       66121629

 

 

WDC datacentre:

In this node we see a different tpstats, it shows one MiscStage 1 Active and 28 pending, why is that?

The netstats shows that all the streaming are not progressing.

Node 1:

x.x.x.140

nodetool tpstats

Pool Name                    Active   Pending      Completed   Blocked  All time blocked

ReadStage                         0         0        7780233         0                 0

RequestResponseStage              0         0              0         0                 0

MutationStage                     0         0       36732109         0                 0

ReadRepairStage                   0         0              0         0                 0

ReplicateOnWriteStage             0         0              0         0                 0

GossipStage                       0         0         641497         0                 0

AntiEntropyStage                  0         0           4834         0                 0

MigrationStage                    0         0             65         0                 0

MemtablePostFlusher               1         5           4400         0                 0

FlushWriter                       0         0           2771         0                 0

MiscStage                         1        28           1720         0                 0

commitlog_archiver                0         0              0         0                 0

AntiEntropySessions               0         0              0         0                 0

InternalResponseStage             0         0             15         0                 0

HintedHandoff                     0         0              8         0                 0

 

Message type           Dropped

RANGE_SLICE                  0

READ_REPAIR                  0

BINARY                       0

READ                         0

MUTATION                     0

_TRACE                       0

REQUEST_RESPONSE             0

 

nodetool netstats

Mode: NORMAL

Nothing streaming to /x.x.x.91

Streaming from: /x.x.x.91

   mykeyspace: /disk1/cassandra/data/mykeyspace/mycloumnfamily2/mykeyspace-mycloumnfamily2-ib-31657-Data.db sections=1192 progress=0/2001599 - 0%

   mykeyspace: /disk1/cassandra/data/mykeyspace/mycloumnfamily2/mykeyspace-mycloumnfamily2-ib-33861-Data.db sections=8 progress=0/36381 - 0%

   mykeyspace: /disk1/cassandra/data/mykeyspace/mycloumnfamily2/mykeyspace-mycloumnfamily2-ib-33900-Data.db sections=2 progress=0/35827 - 0%

Streaming from: /x.x.x.23

   mykeyspace: /disk1/cassandra/data/mykeyspace/mycloumnfamily2/mykeyspace-mycloumnfamily2-ib-20361-Data.db sections=1 progress=0/35973 - 0%

   mykeyspace: /disk1/cassandra/data/mykeyspace/mycloumnfamily1/mykeyspace-mycloumnfamily1-ib-19809-Data.db sections=5809 progress=0/7701015 - 0%

   mykeyspace: /disk1/cassandra/data/mykeyspace/mycloumnfamily2/mykeyspace-mycloumnfamily2-ib-20297-Data.db sections=8 progress=0/36494 - 0%

   mykeyspace: /disk1/cassandra/data/mykeyspace/mycloumnfamily2/mykeyspace-mycloumnfamily2-ib-19444-Data.db sections=1191 progress=0/1964863 - 0%

   mykeyspace: /disk1/cassandra/data/mykeyspace/mycloumnfamily3/mykeyspace-mycloumnfamily3-ib-10019-Data.db sections=2338 progress=0/5357560 - 0%

Streaming from: /x.x.x.78

   mykeyspace: /disk1/cassandra/data/mykeyspace/mycloumnfamily3/mykeyspace-mycloumnfamily3-ib-15329-Data.db sections=2338 progress=0/5358677 - 0%

   mykeyspace: /disk1/cassandra/data/mykeyspace/mycloumnfamily2/mykeyspace-mycloumnfamily2-ib-31112-Data.db sections=1 progress=0/36877 - 0%

   mykeyspace: /disk1/cassandra/data/mykeyspace/mycloumnfamily2/mykeyspace-mycloumnfamily2-ib-28005-Data.db sections=4026 progress=0/7804220 - 0%

Streaming from: /x.x.x.66

   mykeyspace: /disk1/cassandra/data/mykeyspace/mycloumnfamily2/mykeyspace-mycloumnfamily2-ib-42913-Data.db sections=4026 progress=0/7803966 - 0%

   mykeyspace: /disk1/cassandra/data/mykeyspace/mycloumnfamily1/mykeyspace-mycloumnfamily1-ib-39649-Data.db sections=1345 progress=0/184273 - 0%

   mykeyspace: /disk1/cassandra/data/mykeyspace/mycloumnfamily1/mykeyspace-mycloumnfamily1-ib-41289-Data.db sections=1138 progress=0/1471186 - 0%

   mykeyspace: /disk1/cassandra/data/mykeyspace/mycloumnfamily1/mykeyspace-mycloumnfamily1-ib-42101-Data.db sections=133 progress=0/74800 - 0%

   mykeyspace: /disk1/cassandra/data/mykeyspace/mycloumnfamily1/mykeyspace-mycloumnfamily1-ib-42399-Data.db sections=23 progress=0/36965 - 0%

   mykeyspace: /disk1/cassandra/data/mykeyspace/mycloumnfamily1/mykeyspace-mycloumnfamily1-ib-42447-Data.db sections=3 progress=0/36404 - 0%

   mykeyspace: /disk1/cassandra/data/mykeyspace/mycloumnfamily1/mykeyspace-mycloumnfamily1-ib-41119-Data.db sections=2057 progress=0/1234797 - 0%

   mykeyspace: /disk1/cassandra/data/mykeyspace/mycloumnfamily1/mykeyspace-mycloumnfamily1-ib-40226-Data.db sections=4632 progress=0/5706080 - 0%

Pool Name                    Active   Pending      Completed

Commands                        n/a         0           2826

Responses                       n/a         4       40771253

 

NODE 2:

x.x.x.140

nodetool tpstats

Pool Name                    Active   Pending      Completed   Blocked  All time blocked

ReadStage                         0         0        3412912         0                 0

RequestResponseStage              0         0              4         0                 0

MutationStage                     0         0       22540846         0                 0

ReadRepairStage                   0         0              0         0                 0

ReplicateOnWriteStage             0         0              0         0                 0

GossipStage                       0         0         296181         0                 0

AntiEntropyStage                  0         0           6276         0                 0

MigrationStage                    0         0             24         0                 0

MemtablePostFlusher               0         0           5092         0                 0

FlushWriter                       0         0           2953         0                 0

MiscStage                         0         0           1811         0                 0

commitlog_archiver                0         0              0         0                 0

AntiEntropySessions               0         0              0         0                 0

InternalResponseStage             0         0              8         0                 0

HintedHandoff                     0         0              7         0                 0

 

Message type           Dropped

RANGE_SLICE                  0

READ_REPAIR               8942

BINARY                       0

READ                         0

MUTATION                114559

_TRACE                       0

REQUEST_RESPONSE             0

 

nodetool netstats

Mode: NORMAL

Not sending any streams.

Not receiving any streams.

Pool Name                    Active   Pending      Completed

Commands                        n/a         0           1581

Responses                       n/a         0       25108161

 

 

 

Many thanks,

 

 

Haithem

 

_______________________________________________

This message is for information purposes only, it is not a recommendation, advice, offer or solicitation to buy or sell a product or service nor an official confirmation of any transaction. It is directed at persons who are professionals and is not intended for retail customer use. Intended for recipient only. This message is subject to the terms at: www.barclays.com/emaildisclaimer.

For important disclosures, please see: www.barclays.com/salesandtradingdisclaimer regarding market commentary from Barclays Sales and/or Trading, who are active market participants; and in respect of Barclays Research, including disclosures relating to specific issuers, please see http://publicresearch.barclays.com.

_______________________________________________