Return-Path: X-Original-To: apmail-cassandra-user-archive@www.apache.org Delivered-To: apmail-cassandra-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 1B62DDCAA for ; Wed, 29 Aug 2012 08:48:57 +0000 (UTC) Received: (qmail 86905 invoked by uid 500); 29 Aug 2012 08:48:54 -0000 Delivered-To: apmail-cassandra-user-archive@cassandra.apache.org Received: (qmail 86703 invoked by uid 500); 29 Aug 2012 08:48:52 -0000 Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@cassandra.apache.org Delivered-To: mailing list user@cassandra.apache.org Received: (qmail 86681 invoked by uid 99); 29 Aug 2012 08:48:51 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 29 Aug 2012 08:48:51 +0000 X-ASF-Spam-Status: No, hits=2.2 required=5.0 tests=FSL_RCVD_USER,HTML_MESSAGE,RCVD_IN_DNSWL_NONE,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: local policy) Received: from [208.113.200.5] (HELO homiemail-a91.g.dreamhost.com) (208.113.200.5) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 29 Aug 2012 08:48:44 +0000 Received: from homiemail-a91.g.dreamhost.com (localhost [127.0.0.1]) by homiemail-a91.g.dreamhost.com (Postfix) with ESMTP id 3BED7AE069 for ; Wed, 29 Aug 2012 01:48:21 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha1; c=relaxed; d=thelastpickle.com; h=from :content-type:message-id:mime-version:subject:date:references:to :in-reply-to; s=thelastpickle.com; bh=Q7R5TlaOodc7TA+0OycNFgulaW w=; b=CWdUgYA4CWNkG80OvF55TSytejHoSqdDZLJIUXbcdC9W6Elkvtiw30B/R2 f1yak3Gu40HN1HNDKtSaZ5d34YRGw1V80eCEOq2xUck2Y9aHVtScSwLHwhvIOqvP 1ACA6lCn93F31P33L8H1WoxlIy4rD1QMCIJ1RQZ/XnbnD/6ok= Received: from [172.16.1.10] (unknown [203.86.207.101]) (using TLSv1 with cipher AES128-SHA (128/128 bits)) (No client certificate requested) (Authenticated sender: aaron@thelastpickle.com) by homiemail-a91.g.dreamhost.com (Postfix) with ESMTPSA id 89071AE05B for ; Wed, 29 Aug 2012 01:48:19 -0700 (PDT) From: aaron morton Content-Type: multipart/alternative; boundary="Apple-Mail=_E82EA460-3CB0-407A-AA00-8F3639E1EE32" Message-Id: Mime-Version: 1.0 (Mac OS X Mail 6.0 \(1486\)) Subject: Re: Commit log periodic sync? Date: Wed, 29 Aug 2012 20:48:15 +1200 References: <8780CAD3-C2C3-4238-B173-FF98FCF73154@gmail.com> <3D853BDC-6CAC-402B-9936-4011EEDCD7EF@thelastpickle.com> <12069422-BA6E-4BE2-B0AB-92BA084482BA@gmail.com> To: user@cassandra.apache.org In-Reply-To: <12069422-BA6E-4BE2-B0AB-92BA084482BA@gmail.com> X-Mailer: Apple Mail (2.1486) --Apple-Mail=_E82EA460-3CB0-407A-AA00-8F3639E1EE32 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset=utf-8 > - if during the streaming session, the sstable that was about to = stream out but was being compacted, would we see error in the log? No.=20 Remember we never modify files on disk. And the "truth" contained in one = file generated by compaction is the same as the"truth" contained in the = files before compaction. > - could this lead to data not found? No. > - is it safe to let a node serving read/write requests while repair is = running? Yes.=20 All maintenance operations are online operations.=20 =20 > Data created before the last flush was still missing, according to the = client that talked to DC1 (the disaster DC).=20 Losing data from before the flush sounds very strange. That sort of = thing would get noticed, specially if you can pretty much reproduce it.=20= I would wind the test back to a single node see if it works as expected. = If it does walk the test scenario forward to multiple nodes then = multiple DC's until it fails. If the single node test fails check your = test and then grab someone on IRC (I'm aaron_morton and there are plenty = of other smart helpful people there). =20 Sorry it's sometimes hard to diagnose problems over email.=20 Cheers ----------------- Aaron Morton Freelance Developer @aaronmorton http://www.thelastpickle.com On 28/08/2012, at 11:28 PM, rubbish me = wrote: > Thanks again Aaron. >=20 >> I think case I would not expect to see data lose. If you are still in = a test scenario can you try to reproduce the problem ? If possible can = you reproduce it with a single node ? >=20 > We will try that later this week.=20 > ---- >=20 > We did the same exercise this week, this time we did a flush and = snapshot before the DR actually happened - as an attempt to identify if = the commit logs fsync was the problem.=20 >=20 > We can clearly see stables were created for the flush command.=20 > And those sstables were loaded in when the nodes started up again = after the DR exercise.=20 >=20 > At this point we believed all nodes had all the data, so we let them = serving client requests while we run repair on the nodes.=20 >=20 > Data created before the last flush was still missing, according to the = client that talked to DC1 (the disaster DC).=20 >=20 > We had a look at the log of one of the DC1 nodes. The suspicious thing = was that latest sstable was being compacted during streaming sessions of = the repair. But no error was reported.=20 >=20 > Here comes my questions: > - if during the streaming session, the sstable that was about to = stream out but was being compacted, would we see error in the log? > - could this lead to data not found? > - is it safe to let a node serving read/write requests while repair is = running? >=20 > Many thanks again.=20 >=20 > - A >=20 >=20 >=20 >=20 > aaron morton =E6=96=BC 27 Aug 2012 09:08 = =E5=AF=AB=E9=81=93=EF=BC=9A >=20 >>> Brutally. kill -9. >> that's fine. I was thinking about reboot -f -n >>=20 >>> We are wondering if the fsync of the commit log was working. >> I would say yes only because there other reported problems.=20 >>=20 >> I think case I would not expect to see data lose. If you are still in = a test scenario can you try to reproduce the problem ? If possible can = you reproduce it with a single node ? >>=20 >> Cheers >>=20 >>=20 >> ----------------- >> Aaron Morton >> Freelance Developer >> @aaronmorton >> http://www.thelastpickle.com >>=20 >> On 25/08/2012, at 11:00 AM, rubbish me = wrote: >>=20 >>> Thanks, Aaron, for your reply - please see the inline. >>>=20 >>>=20 >>> On 24 Aug 2012, at 11:04, aaron morton wrote: >>>=20 >>>>> - we are running on production linux VMs (not ideal but this is = out of our hands) >>>> Is the VM doing anything wacky with the IO ? >>>=20 >>> Could be. But I thought we would ask here first. This is a bit = difficult to prove cos we dont have the control over these VMs. >>>=20 >>>> =20 >>>>=20 >>>>> As part of a DR exercise, we killed all 6 nodes in DC1, >>>> Nice disaster. Out of interest, what was the shutdown process ? >>>=20 >>> Brutally. kill -9. >>>=20 >>>=20 >>>>=20 >>>>> We noticed that data that was written an hour before the exercise, = around the last memtables being flushed,was not found in DC1.=20 >>>> To confirm, data was written to DC 1 at CL LOCAL_QUORUM before the = DR exercise.=20 >>>>=20 >>>> Was the missing data written before or after the memtable flush ? = I'm trying to understand if the data should have been in the commit log = or the memtables.=20 >>>=20 >>> Missing data was those written after the last flush. These data was = retrievable before the DR exercise. >>>=20 >>>>=20 >>>> Can you provide some more info on how you are detecting it is not = found in DC 1? >>>>=20 >>>=20 >>> We tried hector, consistencylevel=3Dlocal quorum. We had missing = column or the whole row. =20 >>>=20 >>> We tried cassandra-cli on DC1 nodes, same. >>>=20 >>> However once we run the same query on DC2, C* must have then done a = read-repair. That particular piece of result data would appear in DC1 = again. >>>=20 >>>=20 >>>>> If we understand correctly, commit logs are being written first = and then to disk every 10s.=20 >>>> Writes are put into a bounded queue and processed as fast as the IO = can keep up. Every 10s a sync messages is added to the queue. Not that = the commit log segment may rotate at any time which requires a sync.=20 >>>>=20 >>>> A loss of data across all nodes in a DC seems odd. If you can = provide some more information we may be able to help.=20 >>>=20 >>>=20 >>> We are wondering if the fsync of the commit log was working. But we = saw no errors / warning in logs. Wondering if there is way to = verify.... >>>=20 >>>=20 >>>>=20 >>>> Cheers >>>>=20 >>>> ----------------- >>>> Aaron Morton >>>> Freelance Developer >>>> @aaronmorton >>>> http://www.thelastpickle.com >>>>=20 >>>> On 24/08/2012, at 6:01 AM, rubbish me = wrote: >>>>=20 >>>>> Hi all >>>>>=20 >>>>> First off, let's introduce the setup.=20 >>>>>=20 >>>>> - 6 x C* 1.1.2 in active DC (DC1), another 6 in another (DC2) >>>>> - keyspace's RF=3D3 in each DC >>>>> - Hector as client. >>>>> - client talks only to DC1 unless DC1 can't serve the request. In = which case talks only to DC2 >>>>> - commit log was periodically sync with the default setting of = 10s.=20 >>>>> - consistency policy =3D LOCAL QUORUM for both read and write.=20 >>>>> - we are running on production linux VMs (not ideal but this is = out of our hands) >>>>> ----- >>>>> As part of a DR exercise, we killed all 6 nodes in DC1, hector = starts talking to DC2, all the data was still there, everything = continued to work perfectly.=20 >>>>>=20 >>>>> Then we brought all nodes, one by one, in DC1 up. We saw a message = saying all the commit logs were replayed. No errors reported. We didn't = run repair at this time.=20 >>>>>=20 >>>>> We noticed that data that was written an hour before the exercise, = around the last memtables being flushed,was not found in DC1.=20 >>>>>=20 >>>>> If we understand correctly, commit logs are being written first = and then to disk every 10s. At worst we lost the last 10s of data. What = could be the cause of this behaviour?=20 >>>>>=20 >>>>> With the blessing of C* we could recovered all these data from = DC2. But we would like to understand why.=20 >>>>>=20 >>>>> Many thanks in advanced.=20 >>>>>=20 >>>>> Amy >>>>>=20 >>>>>=20 >>>>=20 >>>=20 >>=20 --Apple-Mail=_E82EA460-3CB0-407A-AA00-8F3639E1EE32 Content-Transfer-Encoding: quoted-printable Content-Type: text/html; charset=utf-8
- if during the = streaming session, the sstable that was about to stream out but was = being compacted, would we see error in the = log?
No. 
Remember= we never modify files on disk. And the "truth" contained in one file = generated by compaction is the same as the"truth" contained in the files = before compaction.

- = could this lead to data not found?
No.

- is it safe to let a node serving read/write = requests while repair is running?
Yes. 
All = maintenance operations are online operations. 
 
Data created before the last = flush was still missing, according to the client that talked to DC1 (the = disaster DC). 
Losing= data from before the flush sounds very strange. That sort of thing = would get noticed, specially if you can pretty much reproduce = it. 

I would wind the test back to a single node see if = it works as expected. If it does walk the test scenario forward to = multiple nodes then multiple DC's until it fails. If the single node = test fails check your test and then grab someone on IRC (I'm = aaron_morton and there are plenty of other smart helpful people = there).
 
Sorry it's sometimes hard to diagnose problems over = email. 

Cheers

http://www.thelastpickle.com

On 28/08/2012, at 11:28 PM, rubbish me <rubbish.me@googlemail.com>= ; wrote:

Thanks again = Aaron.

I think case I would not expect to see data = lose. If you are still in a test scenario can you try to reproduce the = problem ? If possible can you reproduce it with a single node = ?
We will = try that later this week. 
----

We did = the same exercise this week, this time we did a flush and snapshot = before the DR actually happened - as an attempt to identify if the = commit logs fsync was the problem. 

We can = clearly see stables were created for the flush = command. 
And = those sstables were loaded in when the nodes started up again after the = DR exercise. 

At this = point we believed all nodes had all the data, so we let them serving = client requests while we run repair on the = nodes. 

Data = created before the last flush was still missing, according to the client = that talked to DC1 (the disaster DC). 

We had a = look at the log of one of the DC1 nodes. The suspicious thing was that = latest sstable was being compacted during streaming sessions of the = repair. But no error was reported. 

Here = comes my questions:
- if = during the streaming session, the sstable that was about to stream out = but was being compacted, would we see error in the = log?
- could = this lead to data not found?
- is it safe to let a node serving read/write requests while = repair is running?

Many = thanks again. 

- = A




aaron morton <aaron@thelastpickle.com> = =E6=96=BC 27 Aug 2012 09:08 = =E5=AF=AB=E9=81=93=EF=BC=9A

Brutally. kill -9.
that's fine. I was thinking = about reboot -f -n
We are wondering if the fsync of the commit log was = working.
I would say yes only because there other = reported problems. 

I think case I would not = expect to see data lose. If you are still in a test scenario can you try = to reproduce the problem ? If possible can you reproduce it with a = single node = ?

Cheers


=
http://www.thelastpickle.com

On 25/08/2012, at 11:00 AM, rubbish me <rubbish.me@googlemail.com>= ; wrote:

Thanks, Aaron, for = your reply - please see the inline.


On = 24 Aug 2012, at 11:04, aaron morton wrote:

- we = are running on production linux VMs (not ideal but this is out of our = hands)
Is the VM doing anything wacky with the IO = ?

Could be.  But I thought we = would ask here first.  This is a bit difficult to prove cos we dont = have the control over these VMs.

 

As part of a DR exercise, we killed all 6 nodes in = DC1,
Nice disaster. Out of interest, what was the shutdown = process ?

Brutally. kill = -9.



We = noticed that data that was written an hour before the exercise, around = the last memtables being flushed,was not found in = DC1. 
To confirm, data was written to DC 1 at = CL LOCAL_QUORUM before the DR = exercise. 

Was the missing data written = before or after the memtable flush ? I'm trying to understand if the = data should have been in the commit log or the = memtables. 

Missing = data was those written after the last flush.  These data was = retrievable before the DR exercise.


Can = you provide some more info on how you are detecting it is not found in = DC 1?


We = tried hector, consistencylevel=3Dlocal quorum.  We had missing = column or the whole row.  

We tried = cassandra-cli on DC1 nodes, same.

However once = we run the same query on DC2, C* must have then done a read-repair. That = particular piece of result data would appear in DC1 = again.


If= we understand correctly, commit logs are being written first and then = to disk every 10s. 
Writes are put into a bounded queue = and processed as fast as the IO can keep up. Every 10s a sync messages = is added to the queue. Not that the commit log segment may rotate at any = time which requires a sync. 

A loss of = data across all nodes in a DC seems odd. If you can provide some more = information we may be able to = help. 


We = are wondering if the fsync of the commit log was working.  But we = saw no errors / warning in logs.  Wondering if there is way to = verify....


http://www.thelastpickle.com

On 24/08/2012, at 6:01 AM, rubbish me <rubbish.me@googlemail.com>= ; wrote:

Hi all

First off, let's introduce the setup. =

- 6 x C* 1.1.2 in active DC (DC1), another 6 in another = (DC2)
- keyspace's RF=3D3 in each DC
- Hector as client.
- = client talks only to DC1 unless DC1 can't serve the request. In which = case talks only to DC2
- commit log was periodically sync with the = default setting of 10s.
- consistency policy =3D LOCAL QUORUM for = both read and write.
- we are running on production linux VMs (not = ideal but this is out of our hands)
-----
As part of a DR = exercise, we killed all 6 nodes in DC1, hector starts talking to DC2, = all the data was still there, everything continued to work perfectly. =

Then we brought all nodes, one by one, in DC1 up. We saw a = message saying all the commit logs were replayed. No errors reported. =  We didn't run repair at this time.

We noticed that data = that was written an hour before the exercise, around the last memtables = being flushed,was not found in DC1.

If we understand correctly, = commit logs are being written first and then to disk every 10s. At worst = we lost the last 10s of data. What could be the cause of this behaviour? =

With the blessing of C* we could recovered all these data from = DC2. But we would like to understand why.

Many thanks in = advanced. =

Amy






= --Apple-Mail=_E82EA460-3CB0-407A-AA00-8F3639E1EE32--