From user-return-36910-apmail-cassandra-user-archive=cassandra.apache.org@cassandra.apache.org Wed Oct 2 03:00:55 2013 Return-Path: X-Original-To: apmail-cassandra-user-archive@www.apache.org Delivered-To: apmail-cassandra-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 043B110F2A for ; Wed, 2 Oct 2013 03:00:55 +0000 (UTC) Received: (qmail 85372 invoked by uid 500); 2 Oct 2013 03:00:51 -0000 Delivered-To: apmail-cassandra-user-archive@cassandra.apache.org Received: (qmail 85339 invoked by uid 500); 2 Oct 2013 03:00:49 -0000 Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@cassandra.apache.org Delivered-To: mailing list user@cassandra.apache.org Received: (qmail 85330 invoked by uid 99); 2 Oct 2013 03:00:48 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 02 Oct 2013 03:00:48 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: local policy includes SPF record at spf.trusted-forwarder.org) Received: from [209.85.160.47] (HELO mail-pb0-f47.google.com) (209.85.160.47) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 02 Oct 2013 03:00:42 +0000 Received: by mail-pb0-f47.google.com with SMTP id rr4so255043pbb.34 for ; Tue, 01 Oct 2013 20:00:20 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:from:content-type:message-id:mime-version :subject:date:references:to:in-reply-to; bh=9D9NINPm7q83cxUwhbA52fKys50iqIYyyri9zQN6RIE=; b=YWdhS/TcqNIRUqMSzj4enPa9arCgXrx3JuhqBAkZrQB7saHTrSbesI1CuOwS6sxeoE InMwpwAAi7sQTGsHPMZTSqsWBkmXK/A8vWNmHYgjKEarLYRZVNnApb4UuzGvLq7fygC0 p/sTQF1Br6pUCsKRLsgiL4kE25GSlRSb4CtbIo6A20k4mZuRSvtq1T2q8ulMO7F7jyAT I9ISfzU5DIXvc0J8PvGzS4okaJBh5OWAveOWyTZ2a0N/Lfp62ACmgHdDLY/zPi6biZNQ JA8pqeaRR4ifuwjuCXay8bmp8oya8YF7/PMOswFTypraDApOHfDzpSRFjIHpAwFbRfor CY8A== X-Gm-Message-State: ALoCoQnSjtxaP32N+10eLmzS+ejw45cob+ei3z9cIUG27o/ihvFIgibDVz/ZF91JnM1YeXtAf6Uy X-Received: by 10.66.162.167 with SMTP id yb7mr760354pab.16.1380682820007; Tue, 01 Oct 2013 20:00:20 -0700 (PDT) Received: from [172.16.1.18] ([203.86.207.101]) by mx.google.com with ESMTPSA id sy10sm213442pac.15.1969.12.31.16.00.00 (version=TLSv1 cipher=ECDHE-RSA-RC4-SHA bits=128/128); Tue, 01 Oct 2013 20:00:19 -0700 (PDT) From: Aaron Morton Content-Type: multipart/alternative; boundary="Apple-Mail=_90AC2315-4FD9-4732-BB2F-DCE4F3EAEE96" Message-Id: <34438E16-EDFE-4B68-959F-B3A66D1AB917@thelastpickle.com> Mime-Version: 1.0 (Mac OS X Mail 6.6 \(1510\)) Subject: Re: Unbalanced ring mystery multi-DC issue with 1.1.11 Date: Wed, 2 Oct 2013 16:00:15 +1300 References: To: user@cassandra.apache.org In-Reply-To: X-Mailer: Apple Mail (2.1510) X-Virus-Checked: Checked by ClamAV on apache.org --Apple-Mail=_90AC2315-4FD9-4732-BB2F-DCE4F3EAEE96 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset=iso-8859-1 Check the logs for messages about nodes going up and down, and also look = at the MessagingService MBean for timeouts. If the node in DR 2 times = out replying to DR1 the DR1 node will store a hint.=20 Also when hints are stored they are TTL'd to the gc_grace_seconds for = the CF (IIRC). If that's low the hints may not have been delivered.=20 Am not aware of any specific tracking for failed hints other than log = messages.=20 A ----------------- Aaron Morton New Zealand @aaronmorton Co-Founder & Principal Consultant Apache Cassandra Consulting http://www.thelastpickle.com On 28/09/2013, at 12:01 AM, Oleg Dulin wrote: > Here is some more information. >=20 > I am running full repair on one of the nodes and I am observing = strange behavior. >=20 > Both DCs were up during the data load. But repair is reporting a lot = of out-of-sync data. Why would that be ? Is there a way for me to tell = that WAN may be dropping hinted handoff traffic ? >=20 > Regards, > Oleg >=20 > On 2013-09-27 10:35:34 +0000, Oleg Dulin said: >=20 >> Wanted to add one more thing: >> I can also tell that the numbers are not consistent across DRs this = way -- I have a column family with really wide rows (a couple million = columns). >> DC1 reports higher column counts than DC2. DC2 only becomes = consistent after I do the command a couple of times and trigger a = read-repair. But why would nodetool repair logs show that everything is = in sync ? >> Regards, >> Oleg >> On 2013-09-27 10:23:45 +0000, Oleg Dulin said: >>> Consider this output from nodetool ring: >>> Address DC Rack Status State Load = Effective-Ownership Token >>> 127605887595351923798765477786913079396 >>> dc1.5 DC1 RAC1 Up Normal 32.07 GB = 50.00% 0 >>> dc2.100 DC2 RAC1 Up Normal 8.21 GB 50.00% = 100 >>> dc1.6 DC1 RAC1 Up Normal 32.82 GB 50.00% = 42535295865117307932921825928971026432 >>> dc2.101 DC2 RAC1 Up Normal 12.41 GB 50.00% = 42535295865117307932921825928971026532 >>> dc1.7 DC1 RAC1 Up Normal 28.37 GB 50.00% = 85070591730234615865843651857942052864 >>> dc2.102 DC2 RAC1 Up Normal 12.27 GB 50.00% = 85070591730234615865843651857942052964 >>> dc1.8 DC1 RAC1 Up Normal 27.34 GB 50.00% = 127605887595351923798765477786913079296 >>> dc2.103 DC2 RAC1 Up Normal 13.46 GB 50.00% = 127605887595351923798765477786913079396 >>> I concealed IPs and DC names for confidentiality. >>> All of the data loading was happening against DC1 at a pretty brisk = rate, of, say, 200K writes per minute. >>> Note how my tokens are offset by 100. Shouldn't that mean that load = on each node should be roughly identical ? In DC1 it is roughly around = 30 G on each node. In DC2 it is almost 1/3rd of the nearest DC1 node by = token range. >>> To verify that the nodes are in sync, I ran nodetool -h localhost = repair MyKeySpace --partitioner-range on each node in DC2. Watching the = logs, I see that the repair went really quick and all column families = are in sync! >>> I need help making sense of this. Is this because DC1 is not fully = compacted ? Is it because DC2 is not fully synced and I am not checking = correctly ? How can I tell that there is still replication going on in = progress (note, I started my load yesterday at 9:50am). >=20 >=20 > --=20 > Regards, > Oleg Dulin > http://www.olegdulin.com >=20 >=20 --Apple-Mail=_90AC2315-4FD9-4732-BB2F-DCE4F3EAEE96 Content-Transfer-Encoding: quoted-printable Content-Type: text/html; charset=iso-8859-1 Check = the logs for messages about nodes going up and down, and also look at = the MessagingService MBean for timeouts. If the node in DR 2 times out = replying to DR1 the DR1 node will store a = hint. 

Also when hints are stored they are TTL'd = to the gc_grace_seconds for the CF (IIRC). If that's low the hints may = not have been delivered. 

Am not aware of = any specific tracking for failed hints other than log = messages. 

A

http://www.thelastpickle.com

On 28/09/2013, at 12:01 AM, Oleg Dulin <oleg.dulin@gmail.com> = wrote:

Here is some more information.

I am running full = repair on one of the nodes and I am observing strange = behavior.

Both DCs were up during the data load. But repair is = reporting a lot of out-of-sync data. Why would that be ? Is there a way = for me to tell that WAN may be dropping hinted handoff traffic = ?

Regards,
Oleg

On 2013-09-27 10:35:34 +0000, Oleg = Dulin said:

Wanted to add one more = thing:
I can also tell that the numbers are not consistent across DRs = this way -- I have a column family with really wide rows (a couple = million columns).
DC1 reports higher column counts than DC2. DC2 only = becomes consistent after I do the command a couple of times and trigger = a read-repair. But why would nodetool repair logs show that everything = is in sync ?
Regards,
Oleg
On 2013-09-27 10:23:45 +0000, Oleg = Dulin said:
Consider this output from = nodetool ring:
Address =         DC =          Rack =        Status State   Load =            Effectiv= e-Ownership Token
127605887595351923798765477786913079396
dc1.5 =      DC1 RAC1 =        Up =     Normal  32.07 GB =        50.00% =       0
dc2.100    DC2 = RAC1        Up =     Normal  8.21 GB =         50.00% =        100
dc1.6 =      DC1 RAC1 =        Up =     Normal  32.82 GB =        50.00% =        42535295865117307932921825928971= 026432
dc2.101    DC2 RAC1 =        Up =     Normal  12.41 GB =        50.00% =        42535295865117307932921825928971= 026532
dc1.7      DC1 RAC1 =        Up =     Normal  28.37 GB =        50.00% =        85070591730234615865843651857942= 052864
dc2.102    DC2 RAC1 =        Up =     Normal  12.27 GB =        50.00% =        85070591730234615865843651857942= 052964
dc1.8      DC1 RAC1 =        Up =     Normal  27.34 GB =        50.00% =        12760588759535192379876547778691= 3079296
dc2.103    DC2 RAC1 =        Up =     Normal  13.46 GB =        50.00% =        12760588759535192379876547778691= 3079396
I concealed IPs and DC names for confidentiality.
All of = the data loading was happening against DC1 at a pretty brisk rate, of, = say, 200K writes per minute.
Note how my tokens are offset by 100. = Shouldn't that mean that load on each node should be roughly identical ? = In DC1 it is roughly around 30 G on each node. In DC2 it is almost 1/3rd = of the nearest DC1 node by token range.
To verify that the nodes are = in sync, I ran nodetool -h localhost repair MyKeySpace = --partitioner-range on each node in DC2. Watching the logs, I see that = the repair went really quick and all column families are in sync!
I = need help making sense of this. Is this because DC1 is not fully = compacted ? Is it because DC2 is not fully synced and I am not checking = correctly ? How can I tell that there is still replication going on in = progress (note, I started my load yesterday at = 9:50am).


--
Regards,
Oleg = Dulin
http://www.olegdulin.com


=

= --Apple-Mail=_90AC2315-4FD9-4732-BB2F-DCE4F3EAEE96--