Return-Path: X-Original-To: apmail-cassandra-user-archive@www.apache.org Delivered-To: apmail-cassandra-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 42132106AE for ; Sun, 17 Nov 2013 02:57:07 +0000 (UTC) Received: (qmail 95274 invoked by uid 500); 17 Nov 2013 02:57:04 -0000 Delivered-To: apmail-cassandra-user-archive@cassandra.apache.org Received: (qmail 95248 invoked by uid 500); 17 Nov 2013 02:57:03 -0000 Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@cassandra.apache.org Delivered-To: mailing list user@cassandra.apache.org Received: (qmail 95240 invoked by uid 99); 17 Nov 2013 02:57:03 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 17 Nov 2013 02:57:03 +0000 X-ASF-Spam-Status: No, hits=-0.7 required=5.0 tests=RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of graham@vast.com designates 209.85.215.172 as permitted sender) Received: from [209.85.215.172] (HELO mail-ea0-f172.google.com) (209.85.215.172) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 17 Nov 2013 02:56:59 +0000 Received: by mail-ea0-f172.google.com with SMTP id h11so1719670eaj.3 for ; Sat, 16 Nov 2013 18:56:38 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:from:content-type:message-id:mime-version :subject:date:references:to:in-reply-to; bh=nfj3NNLdqg0Y/S1PSBW7vw6cF26m+zGfXI9cdzmscj0=; b=h62PDCJ9XA291q857ILt42hqdcu0exmjtaVjQxZUjcCSjS9mUwCZycK4UEIs/L3AKa hB/ZzO/DzYkAGZUF2kq9y51gYdifK895bg1w8PBwins39S+ly4+9HB++vHh/hnwtjE6C +IBlgNSos3d1C84nLaMbond+i8EEWw5HnmgRku4fEEmeL//kIt+Ux+A0MHuqJSfbU7Kq AwibuUpDE3xqvsDaoQtrmmSsJ3r8fNfWwkDTO218v9cYy0iCXaxtsgnnmL8EzRDyeqtr XxdK7dAI5FsTdP/J6P6pxVGrhpz/UBD3Tqx4UO0kRxT/Otln7a8jtPzQCzxj00x45I1B kdWQ== X-Gm-Message-State: ALoCoQlUxDJK96NXhWfVd6KX5U2zMLljIRVuv88LmZWKPUGJcIuoPXhTCY029g98NGBhVuytNgu3 X-Received: by 10.14.107.133 with SMTP id o5mr918550eeg.110.1384656997930; Sat, 16 Nov 2013 18:56:37 -0800 (PST) Received: from [192.168.1.171] (cpe-66-69-231-142.austin.res.rr.com. [66.69.231.142]) by mx.google.com with ESMTPSA id s3sm22663941eeo.3.2013.11.16.18.56.36 for (version=TLSv1 cipher=ECDHE-RSA-RC4-SHA bits=128/128); Sat, 16 Nov 2013 18:56:37 -0800 (PST) From: graham sanderson Content-Type: multipart/signed; boundary="Apple-Mail=_47AA18CB-4B79-4259-93C1-8BA3DB74285C"; protocol="application/pkcs7-signature"; micalg=sha1 Message-Id: Mime-Version: 1.0 (Mac OS X Mail 7.0 \(1822\)) Subject: Re: Disaster recovery question Date: Sat, 16 Nov 2013 20:56:32 -0600 References: <7161E7E0-CF24-4B30-B9CA-2FAAFB0C4C14@vast.com> To: user@cassandra.apache.org In-Reply-To: X-Mailer: Apple Mail (2.1822) X-Virus-Checked: Checked by ClamAV on apache.org --Apple-Mail=_47AA18CB-4B79-4259-93C1-8BA3DB74285C Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset=koi8-r agreed; that was a parallel issue from our ops (I apologize and will try = to avoid duplicates) - I was asking the question from the architecture = side as to what should happen rather than describing it as a bug. = Nonetheless, I/We are still curious if anyone has an answer. On Nov 16, 2013, at 6:13 PM, Mikhail Stepura = wrote: > Looks like someone has the same (1-4) questions: > https://issues.apache.org/jira/browse/CASSANDRA-6364 >=20 > -M >=20 > "graham sanderson" wrote in message = news:7161E7E0-CF24-4B30-B9CA-2FAAFB0C4C14@vast.com... >=20 > We are currently looking to deploy on the 2.0 line of cassandra, but = obviously are watching for bugs (we are currently on 2.0.2) - we are = aware of a couple of interesting known bugs to be fixed in 2.0.3 and one = in 2.1, but none have been observed (in production use cases) or are = likely to affect our current proposed deployment. >=20 > I have a few general questions: >=20 > The first particular test we tried was to physically remove the SSD = commit drive for one of the nodes whilst under HEAVY write load (maybe a = few hundred MB/s of data to be replicated 3 times - 6 node single local = data center) and also while running read performance tests.. We = currently have both node (CQL3) and Astyanax (Thrift) clients. >=20 > Frankly everything was pretty good (no read/write failures or indeed = (observed) latency issues) except, and maybe people can comment on any = of these: >=20 > 1) There were NO errors in the log on the node where we removed the = commit log SSD drive - this surprised us (of course our ops monitoring = would detect the downed disk too, but we hope to be able to look for = ERROR level logging in system.log to cause alerts also) > 2) The node with no commit log disk just kept writing to memtables, = but: > 3) This was causing major CMS GC issues which eventually caused the = node to appear down (nodetool status) to all other nodes, and indeed it = itself saw all other nodes as down. That said dynamic snitch and latency = detection in clients seemed to prevent this being much of a problem = other than it seems potentially undesirable from a server side = standpoint. > 4) nodetool gossipinfo didn=92t report anything abnormal for any nodes = when run from any node. >=20 > Sadly because of an Astyanax issue (we were using the thrift code path = that does a (now unnecessary) describe cluster to check for schema = disagreement before schema changes) we weren=92t able to create a new CF = with a node marked down, and thus couldn=92t immediately add more data = to see what would have happened: EOM or failure (we have since fixed = this to go thru CQL3 code path but not yet re-run the tests because of = other application level testing going on)=85 that said maybe someone = knows off the top of their head if there is a config setting that would = start failing writes (due to memtable size) before GC became an issue, = and we just have this misconfigured. >=20 > Secondly, our test was perhaps unrealistic in that when we brought the = node back up, we did so with the partial commit log on the replaced disk = intact (but the memory data lost), but we did get the following sorts of = errors: >=20 > At level 1, = SSTableReader(path=3D'/data/2/cassandra/searchapi_dsp_approved_feed_beta/2= 0131113151746_20131113_140712_1384348032/searchapi_dsp_approved_feed_beta-= 20131113151746_20131113_140712_1384348032-jb-12-Data.db') = [DecoratedKey(3508309769529441563, = 2d37363730383735353837333637383432323934), = DecoratedKey(9158434231083901894, 343934353436393734343637393130393335)] = overlaps = SSTableReader(path=3D'/data/5/cassandra/searchapi_dsp_approved_feed_beta/2= 0131113151746_20131113_140712_1384348032/searchapi_dsp_approved_feed_beta-= 20131113151746_20131113_140712_1384348032-jb-6-Data.db') = [DecoratedKey(7446234284568345539, = 33393230303730373632303838373837373436), = DecoratedKey(9158426253052616687, = 2d313430303837343831393637343030313136)]. This could be caused by a bug = in Cassandra 1.1.0 .. 1.1.3 or due to the fact that you have dropped = sstables from another node into the data directory. Sending back to L0. = If you didn't drop in sstables, and have not yet run scrub, you should = do so since you may also have rows out-of-order within an sstable >=20 > 5) I guess the question is what is the best way to bring up a failed = node > a) delete all data first? > b) clear data but restore from previous sstable from backup to = miminise subsequent data transfer > c) other suggestions >=20 > 6) Our experience is that taking nodes down that have problems, then = deleting data (subsets if we can see partial corruption) and re-adding = is much safer (but our cluster is VERY fast). That said can we re-sync = data before re-enabling gossip, or at least before serving read requests = from those nodes (not a huge issue but it would mitigate consistency = issues with partially recovered data in the case that multiple quorum = read members were recovering) - note we fallback from (LOCAL_)QUORUM to = (LOCAL_ONE) on UnavaibleException, so have less guarantee compared with = both writing and reading at LOCAL_QUORUM (note that if our LOCAL_QUORUM = writes fail we will just retry when the cluster is fixed - stale data is = not ideal but OK for a while) >=20 > That said given that the commit log on disk pre-dated any uncommitted = lost memtable data, it seems that we shouldn=92t have seen exceptions = because this is kind of like 5)b) in that it should have gotten us = closer to the correct state before the rest of the data was repaired = rather than causing any weirdness (unless it was a missed fsync = problem), but maybe I=92m being naive. >=20 > Sorry for the long post, any thoughts would be appreciated. >=20 > Thanks, >=20 > Graham.=20 >=20 --Apple-Mail=_47AA18CB-4B79-4259-93C1-8BA3DB74285C Content-Disposition: attachment; filename=smime.p7s Content-Type: application/pkcs7-signature; name=smime.p7s Content-Transfer-Encoding: base64 MIAGCSqGSIb3DQEHAqCAMIACAQExCzAJBgUrDgMCGgUAMIAGCSqGSIb3DQEHAQAAoIICuzCCArcw ggIgAgIBTDANBgkqhkiG9w0BAQUFADCBojELMAkGA1UEBhMCVVMxCzAJBgNVBAgTAk9SMREwDwYD VQQHEwhQb3J0bGFuZDEWMBQGA1UEChMNT21uaS1FeHBsb3JlcjEWMBQGA1UECxMNSVQgRGVwYXJ0 bWVudDEbMBkGA1UEAxMSd3d3LmNvcm5lcmNhc2UuY29tMSYwJAYJKoZIhvcNAQkBFhdibG9ja291 dEBjb3JuZXJjYXNlLmNvbTAeFw0xMTA0MDYxNjE0MzFaFw0yMTA0MDMxNjE0MzFaMIGjMQswCQYD VQQGEwJVUzETMBEGA1UECBMKQ2FsaWZvcm5pYTEWMBQGA1UEBxMNU2FuIEZyYW5jaXNjbzEWMBQG A1UEChMNVmFzdC5jb20gSW5jLjEUMBIGA1UECxMLRW5naW5lZXJpbmcxGTAXBgNVBAMTEEdyYWhh bSBTYW5kZXJzb24xHjAcBgkqhkiG9w0BCQEWD2dyYWhhbUB2YXN0LmNvbTCBnzANBgkqhkiG9w0B AQEFAAOBjQAwgYkCgYEAm4K/W/0VdaOiS6tC1G8tSCAw989XCsJXxVPiny/hND6T0jVv4vP0JRiO vNzH6uoINoKQfgUKa+GCqILdY7Jdx61/WKqxltFTu5D0H8sFFNIKgf9cd3yU6t2susKrxaDXRCul pmcJ3AFg4xuG3ZUZt+XTYhBebQfjwgGQh3/pkQUCAwEAATANBgkqhkiG9w0BAQUFAAOBgQCKW+hQ JqNkPRht5fl8FHku80BLAH9ezEJtZJ6EU9fcK9jNPkAJgSEgPXQ++jE+4iYI2nIb/h5RILUxd1Ht m/yZkNRUVCg0+0Qj6aMT/hfOT0kdP8/9OnbmIp2T6qvNN2rAGU58tt3cbuT2j3LMTS2VOGykK4He iNYYqr+K6sPDHTGCAy0wggMpAgEBMIGpMIGiMQswCQYDVQQGEwJVUzELMAkGA1UECBMCT1IxETAP BgNVBAcTCFBvcnRsYW5kMRYwFAYDVQQKEw1PbW5pLUV4cGxvcmVyMRYwFAYDVQQLEw1JVCBEZXBh cnRtZW50MRswGQYDVQQDExJ3d3cuY29ybmVyY2FzZS5jb20xJjAkBgkqhkiG9w0BCQEWF2Jsb2Nr b3V0QGNvcm5lcmNhc2UuY29tAgIBTDAJBgUrDgMCGgUAoIIB2TAYBgkqhkiG9w0BCQMxCwYJKoZI hvcNAQcBMBwGCSqGSIb3DQEJBTEPFw0xMzExMTcwMjU2MzNaMCMGCSqGSIb3DQEJBDEWBBTqBTza AWxAqFz5QGg63L5jCZr6ijCBugYJKwYBBAGCNxAEMYGsMIGpMIGiMQswCQYDVQQGEwJVUzELMAkG A1UECBMCT1IxETAPBgNVBAcTCFBvcnRsYW5kMRYwFAYDVQQKEw1PbW5pLUV4cGxvcmVyMRYwFAYD VQQLEw1JVCBEZXBhcnRtZW50MRswGQYDVQQDExJ3d3cuY29ybmVyY2FzZS5jb20xJjAkBgkqhkiG 9w0BCQEWF2Jsb2Nrb3V0QGNvcm5lcmNhc2UuY29tAgIBTDCBvAYLKoZIhvcNAQkQAgsxgayggakw gaIxCzAJBgNVBAYTAlVTMQswCQYDVQQIEwJPUjERMA8GA1UEBxMIUG9ydGxhbmQxFjAUBgNVBAoT DU9tbmktRXhwbG9yZXIxFjAUBgNVBAsTDUlUIERlcGFydG1lbnQxGzAZBgNVBAMTEnd3dy5jb3Ju ZXJjYXNlLmNvbTEmMCQGCSqGSIb3DQEJARYXYmxvY2tvdXRAY29ybmVyY2FzZS5jb20CAgFMMA0G CSqGSIb3DQEBAQUABIGAX6IWYQ7SQG3VhDU5jHoREWBYZx00opRDcLHAWRlvi/SmTiZ2S1aAr3be qcUUfyUgNkTQOOUj6rXA2Nk36VFOJ95oPTlttnDfk2hhRVuQvSyclbkti8KBb/qWV+oSElWAsHEc wp5dumWETEqZr7yDWct8ktWrHhTFNMpZX1nvP3YAAAAAAAA= --Apple-Mail=_47AA18CB-4B79-4259-93C1-8BA3DB74285C--