incubator-cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From <moshe.kr...@barclays.com>
Subject RE: Repair Hanging C* 1.2.4
Date Thu, 02 May 2013 13:27:43 GMT
Nothing kills a C* ring like an unstable node that won't die but drives the other nodes crazy.

Here's an idea for a new feature:
The Living Will: A configurable set of error conditions (e.g., disk full, heap space filling
up), which, if they happen, make the node voluntarily shut down.

From: Haithem Jarraya [mailto:haithem.jarraya@struq.com]
Sent: Thursday, May 02, 2013 4:13 PM
To: user@cassandra.apache.org
Subject: Re: Repair Hanging C* 1.2.4

Hi Yuki,

I think I found what went wrong, one box in WDC had the disk full during scrub. The box become
unusable but kept alive status.
It was not supposed to go down or marked as dead?
By the way can we specify the snpashot output directory in nodetool command, so that does
not happen again?

Thanks,

H

On 2 May 2013 13:17, Yuki Morishita <mor.yuki@gmail.com<mailto:mor.yuki@gmail.com>>
wrote:

Hi,


ERROR [Thread-12725] 2013-05-01 14:30:54,304 StorageService.java (line 2420) Repair session
failed:
java.lang.IllegalArgumentException: Requested range intersects a local range but is not fully
contained in one; this would lead to imprecise repair

This error means you are repairing the range that spreads across multiple (virtual) nodes.
I think this won't happen unless you specify the repair range with -st and -et option.

How do you start repair?

--
Yuki Morishita
Sent with Airmail<http://airmailapp.info/tracking>

On May 2, 2013 at May 2, 2013, Haithem Jarraya (haithem.jarraya@struq.com<mailto:haithem.jarraya@struq.com>)
wrote:
Hi All,

Cassandra repair has been a real pain for us and it's holding back our migration from mongo
for quiet sometimes now.
We saw errors like this during the repair,
 INFO [AntiEntropyStage:1] 2013-05-01 14:30:54,300 AntiEntropyService.java (line 764) [repair
#ed104480-b26a-11e2-af9b-05179fa66b76] mycolumnfamily is fully synced (1 remaining column
family to sync for this session)
ERROR [Thread-12725] 2013-05-01 14:30:54,304 StorageService.java (line 2420) Repair session
failed:
java.lang.IllegalArgumentException: Requested range intersects a local range but is not fully
contained in one; this would lead to imprecise repair
        at org.apache.cassandra.service.AntiEntropyService.getNeighbors(AntiEntropyService.java:175)
        at org.apache.cassandra.service.AntiEntropyService$RepairSession.<init>(AntiEntropyService.java:621)
        at org.apache.cassandra.service.AntiEntropyService$RepairSession.<init>(AntiEntropyService.java:610)
        at org.apache.cassandra.service.AntiEntropyService.submitRepairSession(AntiEntropyService.java:127)
        at org.apache.cassandra.service.StorageService.forceTableRepair(StorageService.java:2480)
        at org.apache.cassandra.service.StorageService$4.runMayThrow(StorageService.java:2416)
        at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:439)
        at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
        at java.util.concurrent.FutureTask.run(FutureTask.java:138)
        at java.lang.Thread.run(Thread.java:662)


Ok we might have gone beyond the GCGrades(again before repair do not complete) So we ran scrub
 in all node in parallel as it was suggested in this mailing list.
I am not sure if this can be the cause of the problem or not, but in reality we had this issue
of repair not completing and hanging from the day we started testing cassandra 1.2.2, same
issue happening with every upgrade 1.2.3 and now 1.2.4.
I want a way to kick in the repair  if they hang or cancel the previous one without restarting
the cluster, we can't afford to do that the day we go live.

Let me start by presenting our current configuration.
Data Centers:
2 Data center (Amsterdam 6 nodes with RF of 3, Washington D.C with RF of 1)
1 Key space with 3 column families ~= 100GB of data.
Each node running Cassandra 1.2.4 with Java6_update45 running centos 2.6 with 32GB of RAM,
24 Cores @2.00GHZ, JNA v3.2.4 installed, 2 disk( 1 rotational for os and commit logs, and
1 ssd for the data). We are getting really good read performances, 99% < 10ms,  95% <
5ms.

nodetool status
Datacenter: ams01
=================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address       Load       Tokens  Owns   Host ID                               Rack
UN  x.x.x.23   34.04 GB   256     13.1%  4a7bc489-25af-4c20-80f8-499ffcb18e2d  RAC1
UN  x.x.x.79    28.53 GB   256     12.6%  98a1167f-cf75-4201-a454-695e0f7d2d72  RAC1
UN  x.x.x.78    41.31 GB   256     11.9%  62a418b5-3c38-4f66-874d-8138d6d565e5  RAC1
UN  x.x.x.66   54.41 GB   256     13.8%  ab564d16-4081-4866-b8ba-26461d9a93d7  RAC1
UN  x.x.x.91    45.92 GB   256     12.6%  2e1e7179-82e6-4ae6-b986-383acc9fc8a2  RAC1
UN  x.x.x.126  37.31 GB   256     11.8%  d4bed3b1-ffaf-4c68-b560-d270355c8c4b  RAC1
Datacenter: wdc01
=================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address       Load       Tokens  Owns   Host ID                               Rack
UN  x.x.x.144   30.64 GB   256     12.0%  1860011e-fa7c-4ce1-ad6b-c8a38a5ddd02  RAC1
UN  x.x.x.140   86.05 GB   256     12.3%  f3fa985d-5056-4ddc-b146-d02432c3a86e  RAC1

nodetool  status mykeyspace
Datacenter: ams01
=================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address       Load       Tokens  Owns (effective)  Host ID                           
   Rack
UN  x.x.x.66   54.41 GB   256     53.6%             ab564d16-4081-4866-b8ba-26461d9a93d7 
RAC1
UN  x.x.x.91    45.92 GB   256     52.1%             2e1e7179-82e6-4ae6-b986-383acc9fc8a2
 RAC1
UN  x.x.x.126  37.31 GB   256     47.9%             d4bed3b1-ffaf-4c68-b560-d270355c8c4b 
RAC1
UN  x.x.x.23   34.04 GB   256     50.9%             4a7bc489-25af-4c20-80f8-499ffcb18e2d 
RAC1
UN  x.x.x.79    28.53 GB   256     47.4%             98a1167f-cf75-4201-a454-695e0f7d2d72
 RAC1
UN  x.x.x.78    41.31 GB   256     48.0%             62a418b5-3c38-4f66-874d-8138d6d565e5
 RAC1
Datacenter: wdc01
=================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address       Load       Tokens  Owns (effective)  Host ID                           
   Rack
UN  x.x.x140   86.05 GB   256     51.5%             f3fa985d-5056-4ddc-b146-d02432c3a86e 
RAC1
UN  x.x.x.144   30.64 GB   256     48.5%             1860011e-fa7c-4ce1-ad6b-c8a38a5ddd02
 RAC1

The first thing we notice is that the data distribution is off by few %, well I guess, if
we had a repair running with no hanging this should fix the problem.
Now when we run a repair it usually hangs after a day, there is no more repair messages in
the logs or stream, cpu usage goes down.
This the output for nodetool tpstats and nodetool netstats in all boxes.

Amsterdam data center:
NODE 1:
x.x.x.23 (I run the first repair on this node), we can see the AntiEntropySession 4 Active
and 7 pending
nodetool tpstats
Pool Name                    Active   Pending      Completed   Blocked  All time blocked
ReadStage                         0         1       33134051         0                 0
RequestResponseStage              0         0       60655547         0                 0
MutationStage                     0         0       46000521         0                 0
ReadRepairStage                   0         0        2712610         0                 0
ReplicateOnWriteStage             0         0              0         0                 0
GossipStage                       0         0         848258         0                 0
AntiEntropyStage                  0         0          24530         0                 0
MigrationStage                    0         0             56         0                 0
MemtablePostFlusher               0         0           3295         0                 0
FlushWriter                       0         0           1487         0                 0
MiscStage                         0         0            245         0                 0
commitlog_archiver                0         0              0         0                 0
AntiEntropySessions               4         7            891         0                 0
InternalResponseStage             0         0              7         0                 0
HintedHandoff                     0         0              7         0                 0

Message type           Dropped
RANGE_SLICE                  0
READ_REPAIR                  0
BINARY                       0
READ                         0
MUTATION                     0
_TRACE                       0
REQUEST_RESPONSE             0

For nodetool stats,  it says nothing streaming from x.x.x.140 which is the ip of 1 nodes in
WDC datacentre.
nodetool netstats
Mode: NORMAL
Not sending any streams.
Nothing streaming from /x.x.x.140
Pool Name                    Active   Pending      Completed
Commands                        n/a         0       60710552
Responses                       n/a         0       56269681

NODE 2:
x.x.x.126 nothing going on this node.
nodetool tpstats
Pool Name                    Active   Pending      Completed   Blocked  All time blocked
ReadStage                         0         0       34896655         0                 0
RequestResponseStage              0         0       74815324         0                 0
MutationStage                     0         0       44827842         0                 0
ReadRepairStage                   0         0        3175404         0                 0
ReplicateOnWriteStage             0         0              0         0                 0
GossipStage                       0         0         971226         0                 0
AntiEntropyStage                  0         0          11904         0                 0
MigrationStage                    0         0             59         0                 0
MemtablePostFlusher               0         0           7029         0                 0
FlushWriter                       0         0           3000         0                 0
MiscStage                         0         0           1300         0                 0
commitlog_archiver                0         0              0         0                 0
AntiEntropySessions               0         0              0         0                 0
InternalResponseStage             0         0              7         0                 0
HintedHandoff                     0         0              8         0                 0

Message type           Dropped
RANGE_SLICE                  0
READ_REPAIR                  0
BINARY                       0
READ                         0
MUTATION                     0
_TRACE                       0
REQUEST_RESPONSE             0

Same output for netstats as the previous node
nodetool netstats
Mode: NORMAL
Not sending any streams.
Nothing streaming from /x.x.x.140
Pool Name                    Active   Pending      Completed
Commands                        n/a         0       75174907
Responses                       n/a         0       55239904

NODE 3:
x.x.x.78 after the repair stopped in NODE 1, I tried to run a repair on this node to see if,
it will change things or not (I think I tried to run it twice)
We can see 4 AntiEntropySession Active and 6 Pending
nodetool tpstats
Pool Name                    Active   Pending      Completed   Blocked  All time blocked
ReadStage                         0         0       29928646         0                 0
RequestResponseStage              0         0       81431526         0                 0
MutationStage                     0         0       46631197         0                 0
ReadRepairStage                   0         0        3352193         0                 0
ReplicateOnWriteStage             0         0              0         0                 0
GossipStage                       0         0         994857         0                 0
AntiEntropyStage                  0         0          32477         0                 0
MigrationStage                    0         0             44         0                 0
MemtablePostFlusher               0         0           7521         0                 0
FlushWriter                       0         0           3110         0                 0
MiscStage                         0         0           1023         0                 0
commitlog_archiver                0         0              0         0                 0
AntiEntropySessions               4         6            360         0                 0
InternalResponseStage             0         0             70         0                 0
HintedHandoff                     0         0              9         0                 0

Message type           Dropped
RANGE_SLICE                  0
READ_REPAIR                  0
BINARY                       0
READ                         0
MUTATION                     0
_TRACE                       0
REQUEST_RESPONSE             0


We can see for the netstats that it is waiting
nodetool netstats
Mode: NORMAL
Not sending any streams.
Nothing streaming from /x.x.x.140
Pool Name                    Active   Pending      Completed
Commands                        n/a         0       81478294
Responses                       n/a         3       50728352

NODE 4:
x.x.x. 66
nodetool tpstats
Pool Name                    Active   Pending      Completed   Blocked  All time blocked
ReadStage                         0         0       31542526         0                 0
RequestResponseStage              0         0       66173136         0                 0
MutationStage                     0         0       46796311         0                 0
ReadRepairStage                   0         0        2542891         0                 0
ReplicateOnWriteStage             0         0              0         0                 0
GossipStage                       0         0         726267         0                 0
AntiEntropyStage                  0         0           3782         0                 0
MigrationStage                    0         0             50         0                 0
MemtablePostFlusher               0         0           2807         0                 0
FlushWriter                       0         0           1400         0                 2
MiscStage                         0         0            679         0                 0
commitlog_archiver                0         0              0         0                 0
AntiEntropySessions               0         0              0         0                 0
InternalResponseStage             0         0              3         0                 0
HintedHandoff                     0         0              8         0                 0

Message type           Dropped
RANGE_SLICE                  0
READ_REPAIR                  0
BINARY                       0
READ                         0
MUTATION                     0
_TRACE                       0
REQUEST_RESPONSE             0

nodetool netstats
Mode: NORMAL
Not sending any streams.
Nothing streaming from /x.x.x.140
Pool Name                    Active   Pending      Completed
Commands                        n/a         0       66201849
Responses                       n/a         0       54649566

NODE 5
x.x.x.79
nodetool tpstats
Pool Name                    Active   Pending      Completed   Blocked  All time blocked
ReadStage                         0         0        4807546         0                 0
RequestResponseStage              0         0       15208415         0                 0
MutationStage                     0         0       17640854         0                 0
ReadRepairStage                   0         0         208035         0                 0
ReplicateOnWriteStage             0         0              0         0                 0
GossipStage                       0         0         241855         0                 0
AntiEntropyStage                  0         0           1096         0                 0
MigrationStage                    0         0             31         0                 0
MemtablePostFlusher               0         0            801         0                 0
FlushWriter                       0         0            351         0                 0
MiscStage                         0         0            101         0                 0
commitlog_archiver                0         0              0         0                 0
AntiEntropySessions               0         0              0         0                 0
InternalResponseStage             0         0             10         0                 0
HintedHandoff                     0         0              0         0                 0

Message type           Dropped
RANGE_SLICE                  0
READ_REPAIR                  0
BINARY                       0
READ                         0
MUTATION                     0
_TRACE                       0
REQUEST_RESPONSE             0

# nodetool netstats
Mode: NORMAL
Not sending any streams.
Nothing streaming from /x.x.x.140
Pool Name                    Active   Pending      Completed
Commands                        n/a         0       15216079
Responses                       n/a         2       20432998

NODE 6:
x.x.x.91
nodetool tpstats
tPool Name                    Active   Pending      Completed   Blocked  All time blocked
ReadStage                         0         0       50507669         0                 0
RequestResponseStage              0         0       72430667         0                 0
MutationStage                     0         0       47096834         0                 0
ReadRepairStage                   0         0        3135286         0                 0
ReplicateOnWriteStage             0         0              0         0                 0
GossipStage                       0         0         728625         0                 0
AntiEntropyStage                  0         0           3996         0                 0
MigrationStage                    0         0             57         0                 0
MemtablePostFlusher               0         0           2941         0                 0
FlushWriter                       0         0           1453         0                 2
MiscStage                         0         0            743         0                 0
commitlog_archiver                0         0              0         0                 0
AntiEntropySessions               0         0              0         0                 0
InternalResponseStage             0         0              3         0                 0
HintedHandoff                     0         0             10         0                 0

Message type           Dropped
RANGE_SLICE                  0
READ_REPAIR                  0
BINARY                       0
READ                         0
MUTATION                     0
_TRACE                       0
REQUEST_RESPONSE             0

nodetool netstats
Mode: NORMAL
Not sending any streams.
Nothing streaming from /x.x.x.140
Pool Name                    Active   Pending      Completed
Commands                        n/a         0       73084789
Responses                       n/a         0       66121629


WDC datacentre:
In this node we see a different tpstats, it shows one MiscStage 1 Active and 28 pending, why
is that?
The netstats shows that all the streaming are not progressing.
Node 1:
x.x.x.140
nodetool tpstats
Pool Name                    Active   Pending      Completed   Blocked  All time blocked
ReadStage                         0         0        7780233         0                 0
RequestResponseStage              0         0              0         0                 0
MutationStage                     0         0       36732109         0                 0
ReadRepairStage                   0         0              0         0                 0
ReplicateOnWriteStage             0         0              0         0                 0
GossipStage                       0         0         641497         0                 0
AntiEntropyStage                  0         0           4834         0                 0
MigrationStage                    0         0             65         0                 0
MemtablePostFlusher               1         5           4400         0                 0
FlushWriter                       0         0           2771         0                 0
MiscStage                         1        28           1720         0                 0
commitlog_archiver                0         0              0         0                 0
AntiEntropySessions               0         0              0         0                 0
InternalResponseStage             0         0             15         0                 0
HintedHandoff                     0         0              8         0                 0

Message type           Dropped
RANGE_SLICE                  0
READ_REPAIR                  0
BINARY                       0
READ                         0
MUTATION                     0
_TRACE                       0
REQUEST_RESPONSE             0

nodetool netstats
Mode: NORMAL
Nothing streaming to /x.x.x.91
Streaming from: /x.x.x.91
   mykeyspace: /disk1/cassandra/data/mykeyspace/mycloumnfamily2/mykeyspace-mycloumnfamily2-ib-31657-Data.db
sections=1192 progress=0/2001599 - 0%
   mykeyspace: /disk1/cassandra/data/mykeyspace/mycloumnfamily2/mykeyspace-mycloumnfamily2-ib-33861-Data.db
sections=8 progress=0/36381 - 0%
   mykeyspace: /disk1/cassandra/data/mykeyspace/mycloumnfamily2/mykeyspace-mycloumnfamily2-ib-33900-Data.db
sections=2 progress=0/35827 - 0%
Streaming from: /x.x.x.23
   mykeyspace: /disk1/cassandra/data/mykeyspace/mycloumnfamily2/mykeyspace-mycloumnfamily2-ib-20361-Data.db
sections=1 progress=0/35973 - 0%
   mykeyspace: /disk1/cassandra/data/mykeyspace/mycloumnfamily1/mykeyspace-mycloumnfamily1-ib-19809-Data.db
sections=5809 progress=0/7701015 - 0%
   mykeyspace: /disk1/cassandra/data/mykeyspace/mycloumnfamily2/mykeyspace-mycloumnfamily2-ib-20297-Data.db
sections=8 progress=0/36494 - 0%
   mykeyspace: /disk1/cassandra/data/mykeyspace/mycloumnfamily2/mykeyspace-mycloumnfamily2-ib-19444-Data.db
sections=1191 progress=0/1964863 - 0%
   mykeyspace: /disk1/cassandra/data/mykeyspace/mycloumnfamily3/mykeyspace-mycloumnfamily3-ib-10019-Data.db
sections=2338 progress=0/5357560 - 0%
Streaming from: /x.x.x.78
   mykeyspace: /disk1/cassandra/data/mykeyspace/mycloumnfamily3/mykeyspace-mycloumnfamily3-ib-15329-Data.db
sections=2338 progress=0/5358677 - 0%
   mykeyspace: /disk1/cassandra/data/mykeyspace/mycloumnfamily2/mykeyspace-mycloumnfamily2-ib-31112-Data.db
sections=1 progress=0/36877 - 0%
   mykeyspace: /disk1/cassandra/data/mykeyspace/mycloumnfamily2/mykeyspace-mycloumnfamily2-ib-28005-Data.db
sections=4026 progress=0/7804220 - 0%
Streaming from: /x.x.x.66
   mykeyspace: /disk1/cassandra/data/mykeyspace/mycloumnfamily2/mykeyspace-mycloumnfamily2-ib-42913-Data.db
sections=4026 progress=0/7803966 - 0%
   mykeyspace: /disk1/cassandra/data/mykeyspace/mycloumnfamily1/mykeyspace-mycloumnfamily1-ib-39649-Data.db
sections=1345 progress=0/184273 - 0%
   mykeyspace: /disk1/cassandra/data/mykeyspace/mycloumnfamily1/mykeyspace-mycloumnfamily1-ib-41289-Data.db
sections=1138 progress=0/1471186 - 0%
   mykeyspace: /disk1/cassandra/data/mykeyspace/mycloumnfamily1/mykeyspace-mycloumnfamily1-ib-42101-Data.db
sections=133 progress=0/74800 - 0%
   mykeyspace: /disk1/cassandra/data/mykeyspace/mycloumnfamily1/mykeyspace-mycloumnfamily1-ib-42399-Data.db
sections=23 progress=0/36965 - 0%
   mykeyspace: /disk1/cassandra/data/mykeyspace/mycloumnfamily1/mykeyspace-mycloumnfamily1-ib-42447-Data.db
sections=3 progress=0/36404 - 0%
   mykeyspace: /disk1/cassandra/data/mykeyspace/mycloumnfamily1/mykeyspace-mycloumnfamily1-ib-41119-Data.db
sections=2057 progress=0/1234797 - 0%
   mykeyspace: /disk1/cassandra/data/mykeyspace/mycloumnfamily1/mykeyspace-mycloumnfamily1-ib-40226-Data.db
sections=4632 progress=0/5706080 - 0%
Pool Name                    Active   Pending      Completed
Commands                        n/a         0           2826
Responses                       n/a         4       40771253

NODE 2:
x.x.x.140
nodetool tpstats
Pool Name                    Active   Pending      Completed   Blocked  All time blocked
ReadStage                         0         0        3412912         0                 0
RequestResponseStage              0         0              4         0                 0
MutationStage                     0         0       22540846         0                 0
ReadRepairStage                   0         0              0         0                 0
ReplicateOnWriteStage             0         0              0         0                 0
GossipStage                       0         0         296181         0                 0
AntiEntropyStage                  0         0           6276         0                 0
MigrationStage                    0         0             24         0                 0
MemtablePostFlusher               0         0           5092         0                 0
FlushWriter                       0         0           2953         0                 0
MiscStage                         0         0           1811         0                 0
commitlog_archiver                0         0              0         0                 0
AntiEntropySessions               0         0              0         0                 0
InternalResponseStage             0         0              8         0                 0
HintedHandoff                     0         0              7         0                 0

Message type           Dropped
RANGE_SLICE                  0
READ_REPAIR               8942
BINARY                       0
READ                         0
MUTATION                114559
_TRACE                       0
REQUEST_RESPONSE             0

nodetool netstats
Mode: NORMAL
Not sending any streams.
Not receiving any streams.
Pool Name                    Active   Pending      Completed
Commands                        n/a         0           1581
Responses                       n/a         0       25108161



Many thanks,


Haithem


_______________________________________________

This message is for information purposes only, it is not a recommendation, advice, offer or
solicitation to buy or sell a product or service nor an official confirmation of any transaction.
It is directed at persons who are professionals and is not intended for retail customer use.
Intended for recipient only. This message is subject to the terms at: www.barclays.com/emaildisclaimer.

For important disclosures, please see: www.barclays.com/salesandtradingdisclaimer regarding
market commentary from Barclays Sales and/or Trading, who are active market participants;
and in respect of Barclays Research, including disclosures relating to specific issuers, please
see http://publicresearch.barclays.com.

_______________________________________________

Mime
View raw message