Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@cassandra.apache.org
Received-SPF: pass (athena.apache.org: domain of graham@vast.com designates
 209.85.215.172 as permitted sender)
From: graham sanderson <graham@vast.com>
Content-Type: multipart/signed;
 boundary="Apple-Mail=_47AA18CB-4B79-4259-93C1-8BA3DB74285C";
 protocol="application/pkcs7-signature"; micalg=sha1
Message-Id: <F88E51B3-10A7-4098-AF5C-3A1422BE0B20@vast.com>
Mime-Version: 1.0 (Mac OS X Mail 7.0 \(1822\))
Subject: Re: Disaster recovery question
Date: Sat, 16 Nov 2013 20:56:32 -0600
References: <7161E7E0-CF24-4B30-B9CA-2FAAFB0C4C14@vast.com>
 <l691nb$f8j$1@ger.gmane.org>
To: user@cassandra.apache.org
In-Reply-To: <l691nb$f8j$1@ger.gmane.org>


--Apple-Mail=_47AA18CB-4B79-4259-93C1-8BA3DB74285C
Content-Transfer-Encoding: quoted-printable
Content-Type: text/plain;
	charset=koi8-r

agreed; that was a parallel issue from our ops (I apologize and will try =
to avoid duplicates) - I was asking the question from the architecture =
side as to what should happen rather than describing it as a bug. =
Nonetheless, I/We are still curious if anyone has an answer.

On Nov 16, 2013, at 6:13 PM, Mikhail Stepura =
<mikhail.stepura@outlook.com> wrote:

> Looks like someone has the same (1-4) questions:
> https://issues.apache.org/jira/browse/CASSANDRA-6364
>=20
> -M
>=20
> "graham sanderson"  wrote in message =
news:7161E7E0-CF24-4B30-B9CA-2FAAFB0C4C14@vast.com...
>=20
> We are currently looking to deploy on the 2.0 line of cassandra, but =
obviously are watching for bugs (we are currently on 2.0.2) - we are =
aware of a couple of interesting known bugs to be fixed in 2.0.3 and one =
in 2.1, but none have been observed (in production use cases) or are =
likely to affect our current proposed deployment.
>=20
> I have a few general questions:
>=20
> The first particular test we tried was to physically remove the SSD =
commit drive for one of the nodes whilst under HEAVY write load (maybe a =
few hundred MB/s of data to be replicated 3 times - 6 node single local =
data center) and also while running read performance tests.. We =
currently have both node (CQL3) and Astyanax (Thrift) clients.
>=20
> Frankly everything was pretty good (no read/write failures or indeed =
(observed) latency issues) except, and maybe people can comment on any =
of these:
>=20
> 1) There were NO errors in the log on the node where we removed the =
commit log SSD drive - this surprised us (of course our ops monitoring =
would detect the downed disk too, but we hope to be able to look for =
ERROR level logging in system.log to cause alerts also)
> 2) The node with no commit log disk just kept writing to memtables, =
but:
> 3) This was causing major CMS GC issues which eventually caused the =
node to appear down (nodetool status) to all other nodes, and indeed it =
itself saw all other nodes as down. That said dynamic snitch and latency =
detection in clients seemed to prevent this being much of a problem =
other than it seems potentially undesirable from a server side =
standpoint.
> 4) nodetool gossipinfo didn=92t report anything abnormal for any nodes =
when run from any node.
>=20
> Sadly because of an Astyanax issue (we were using the thrift code path =
that does a (now unnecessary) describe cluster to check for schema =
disagreement before schema changes) we weren=92t able to create a new CF =
with a node marked down, and thus couldn=92t immediately add more data =
to see what would have happened: EOM or failure (we have since fixed =
this to go thru CQL3 code path but not yet re-run the tests because of =
other application level testing going on)=85 that said maybe someone =
knows off the top of their head if there is a config setting that would =
start failing writes (due to memtable size) before GC became an issue, =
and we just have this misconfigured.
>=20
> Secondly, our test was perhaps unrealistic in that when we brought the =
node back up, we did so with the partial commit log on the replaced disk =
intact (but the memory data lost), but we did get the following sorts of =
errors:
>=20
> At level 1, =
SSTableReader(path=3D'/data/2/cassandra/searchapi_dsp_approved_feed_beta/2=
0131113151746_20131113_140712_1384348032/searchapi_dsp_approved_feed_beta-=
20131113151746_20131113_140712_1384348032-jb-12-Data.db') =
[DecoratedKey(3508309769529441563, =
2d37363730383735353837333637383432323934), =
DecoratedKey(9158434231083901894, 343934353436393734343637393130393335)] =
overlaps =
SSTableReader(path=3D'/data/5/cassandra/searchapi_dsp_approved_feed_beta/2=
0131113151746_20131113_140712_1384348032/searchapi_dsp_approved_feed_beta-=
20131113151746_20131113_140712_1384348032-jb-6-Data.db') =
[DecoratedKey(7446234284568345539, =
33393230303730373632303838373837373436), =
DecoratedKey(9158426253052616687, =
2d313430303837343831393637343030313136)]. This could be caused by a bug =
in Cassandra 1.1.0 .. 1.1.3 or due to the fact that you have dropped =
sstables from another node into the data directory. Sending back to L0.  =
If you didn't drop in sstables, and have not yet run scrub, you should =
do so since you may also have rows out-of-order within an sstable
>=20
> 5) I guess the question is what is the best way to bring up a failed =
node
> a) delete all data first?
> b) clear data but restore from previous sstable from backup to =
miminise subsequent data transfer
> c) other suggestions
>=20
> 6) Our experience is that taking nodes down that have problems, then =
deleting data (subsets if we can see partial corruption) and re-adding =
is much safer (but our cluster is VERY fast). That said can we re-sync =
data before re-enabling gossip, or at least before serving read requests =
from those nodes (not a huge issue but it would mitigate consistency =
issues with partially recovered data in the case that multiple quorum =
read members were recovering) - note we fallback from (LOCAL_)QUORUM to =
(LOCAL_ONE) on UnavaibleException, so have less guarantee compared with =
both writing and reading at LOCAL_QUORUM (note that if our LOCAL_QUORUM =
writes fail we will just retry when the cluster is fixed - stale data is =
not ideal but OK for a while)
>=20
> That said given that the commit log on disk pre-dated any uncommitted =
lost memtable data, it seems that we shouldn=92t have seen exceptions =
because this is kind of like 5)b) in that it should have gotten us =
closer to the correct state before the rest of the data was repaired =
rather than causing any weirdness (unless it was a missed fsync =
problem), but maybe I=92m being naive.
>=20
> Sorry for the long post, any thoughts would be appreciated.
>=20
> Thanks,
>=20
> Graham.=20
>=20


--Apple-Mail=_47AA18CB-4B79-4259-93C1-8BA3DB74285C
Content-Disposition: attachment;
	filename=smime.p7s
Content-Type: application/pkcs7-signature;
	name=smime.p7s
Content-Transfer-Encoding: base64

MIAGCSqGSIb3DQEHAqCAMIACAQExCzAJBgUrDgMCGgUAMIAGCSqGSIb3DQEHAQAAoIICuzCCArcw
ggIgAgIBTDANBgkqhkiG9w0BAQUFADCBojELMAkGA1UEBhMCVVMxCzAJBgNVBAgTAk9SMREwDwYD
VQQHEwhQb3J0bGFuZDEWMBQGA1UEChMNT21uaS1FeHBsb3JlcjEWMBQGA1UECxMNSVQgRGVwYXJ0
bWVudDEbMBkGA1UEAxMSd3d3LmNvcm5lcmNhc2UuY29tMSYwJAYJKoZIhvcNAQkBFhdibG9ja291
dEBjb3JuZXJjYXNlLmNvbTAeFw0xMTA0MDYxNjE0MzFaFw0yMTA0MDMxNjE0MzFaMIGjMQswCQYD
VQQGEwJVUzETMBEGA1UECBMKQ2FsaWZvcm5pYTEWMBQGA1UEBxMNU2FuIEZyYW5jaXNjbzEWMBQG
A1UEChMNVmFzdC5jb20gSW5jLjEUMBIGA1UECxMLRW5naW5lZXJpbmcxGTAXBgNVBAMTEEdyYWhh
bSBTYW5kZXJzb24xHjAcBgkqhkiG9w0BCQEWD2dyYWhhbUB2YXN0LmNvbTCBnzANBgkqhkiG9w0B
AQEFAAOBjQAwgYkCgYEAm4K/W/0VdaOiS6tC1G8tSCAw989XCsJXxVPiny/hND6T0jVv4vP0JRiO
vNzH6uoINoKQfgUKa+GCqILdY7Jdx61/WKqxltFTu5D0H8sFFNIKgf9cd3yU6t2susKrxaDXRCul
pmcJ3AFg4xuG3ZUZt+XTYhBebQfjwgGQh3/pkQUCAwEAATANBgkqhkiG9w0BAQUFAAOBgQCKW+hQ
JqNkPRht5fl8FHku80BLAH9ezEJtZJ6EU9fcK9jNPkAJgSEgPXQ++jE+4iYI2nIb/h5RILUxd1Ht
m/yZkNRUVCg0+0Qj6aMT/hfOT0kdP8/9OnbmIp2T6qvNN2rAGU58tt3cbuT2j3LMTS2VOGykK4He
iNYYqr+K6sPDHTGCAy0wggMpAgEBMIGpMIGiMQswCQYDVQQGEwJVUzELMAkGA1UECBMCT1IxETAP
BgNVBAcTCFBvcnRsYW5kMRYwFAYDVQQKEw1PbW5pLUV4cGxvcmVyMRYwFAYDVQQLEw1JVCBEZXBh
cnRtZW50MRswGQYDVQQDExJ3d3cuY29ybmVyY2FzZS5jb20xJjAkBgkqhkiG9w0BCQEWF2Jsb2Nr
b3V0QGNvcm5lcmNhc2UuY29tAgIBTDAJBgUrDgMCGgUAoIIB2TAYBgkqhkiG9w0BCQMxCwYJKoZI
hvcNAQcBMBwGCSqGSIb3DQEJBTEPFw0xMzExMTcwMjU2MzNaMCMGCSqGSIb3DQEJBDEWBBTqBTza
AWxAqFz5QGg63L5jCZr6ijCBugYJKwYBBAGCNxAEMYGsMIGpMIGiMQswCQYDVQQGEwJVUzELMAkG
A1UECBMCT1IxETAPBgNVBAcTCFBvcnRsYW5kMRYwFAYDVQQKEw1PbW5pLUV4cGxvcmVyMRYwFAYD
VQQLEw1JVCBEZXBhcnRtZW50MRswGQYDVQQDExJ3d3cuY29ybmVyY2FzZS5jb20xJjAkBgkqhkiG
9w0BCQEWF2Jsb2Nrb3V0QGNvcm5lcmNhc2UuY29tAgIBTDCBvAYLKoZIhvcNAQkQAgsxgayggakw
gaIxCzAJBgNVBAYTAlVTMQswCQYDVQQIEwJPUjERMA8GA1UEBxMIUG9ydGxhbmQxFjAUBgNVBAoT
DU9tbmktRXhwbG9yZXIxFjAUBgNVBAsTDUlUIERlcGFydG1lbnQxGzAZBgNVBAMTEnd3dy5jb3Ju
ZXJjYXNlLmNvbTEmMCQGCSqGSIb3DQEJARYXYmxvY2tvdXRAY29ybmVyY2FzZS5jb20CAgFMMA0G
CSqGSIb3DQEBAQUABIGAX6IWYQ7SQG3VhDU5jHoREWBYZx00opRDcLHAWRlvi/SmTiZ2S1aAr3be
qcUUfyUgNkTQOOUj6rXA2Nk36VFOJ95oPTlttnDfk2hhRVuQvSyclbkti8KBb/qWV+oSElWAsHEc
wp5dumWETEqZr7yDWct8ktWrHhTFNMpZX1nvP3YAAAAAAAA=

--Apple-Mail=_47AA18CB-4B79-4259-93C1-8BA3DB74285C--