Return-Path: X-Original-To: apmail-cassandra-user-archive@www.apache.org Delivered-To: apmail-cassandra-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 1B6A96392 for ; Fri, 3 Jun 2011 09:33:19 +0000 (UTC) Received: (qmail 76916 invoked by uid 500); 3 Jun 2011 09:33:16 -0000 Delivered-To: apmail-cassandra-user-archive@cassandra.apache.org Received: (qmail 76892 invoked by uid 500); 3 Jun 2011 09:33:16 -0000 Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@cassandra.apache.org Delivered-To: mailing list user@cassandra.apache.org Received: (qmail 76884 invoked by uid 99); 3 Jun 2011 09:33:16 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 03 Jun 2011 09:33:16 +0000 X-ASF-Spam-Status: No, hits=3.7 required=5.0 tests=FREEMAIL_ENVFROM_END_DIGIT,FREEMAIL_FROM,HTML_MESSAGE,RCVD_IN_DNSWL_LOW,RFC_ABUSE_POST,SPF_PASS,T_TO_NO_BRKTS_FREEMAIL X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of zhangyf2007@gmail.com designates 209.85.214.44 as permitted sender) Received: from [209.85.214.44] (HELO mail-bw0-f44.google.com) (209.85.214.44) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 03 Jun 2011 09:33:11 +0000 Received: by bwz13 with SMTP id 13so1877225bwz.31 for ; Fri, 03 Jun 2011 02:32:50 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:in-reply-to:references :x-goomoji-body:date:message-id:subject:from:to:content-type; bh=4z+UKOgcoL7Q0jRIGv1iLbYqLTh2/v6Ao0USTV1XHNE=; b=MMvF/cvWzd8eh74cPi+lcEtkr0ey/aX9+3Ksp1FeRD15U9yIVchLBAW5cAT9V+dreu dJoPeiZY6oQI+cuM5YoI/wU4kx84inLc+24TjwNnibYoP8DnZNbNUYEwqa0is2UCCI/l TU73qeCxWkkCxhUcjWMBDQPAndS+iwQjNDX2Y= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:x-goomoji-body:date:message-id :subject:from:to:content-type; b=bY1L3uiFluzI3EUbsWZFOgT8RgW7XHSykConGpwa/ZyEtTiGFDxlC+QBeZ5V130sW7 AVtIE7jl/19Bh3pac2+yQF0Zt8BhDm/IBakIR4kXwsRsRVAyvaq4mG+SZzZ3J2rcaaob 1QrR6LTBucE5klyo08pvEVXT5BtD1DF9IfNH0= MIME-Version: 1.0 Received: by 10.204.133.153 with SMTP id f25mr1800221bkt.38.1307093570029; Fri, 03 Jun 2011 02:32:50 -0700 (PDT) Received: by 10.204.101.80 with HTTP; Fri, 3 Jun 2011 02:32:49 -0700 (PDT) In-Reply-To: References: X-Goomoji-Body: true Date: Fri, 3 Jun 2011 17:32:49 +0800 Message-ID: Subject: Re: sync commitlog in batch mode lose data From: Preston Chang To: user@cassandra.apache.org Content-Type: multipart/related; boundary=00151747343aee0a6604a4cb6f21 --00151747343aee0a6604a4cb6f21 Content-Type: multipart/alternative; boundary=00151747343aee0a6104a4cb6f20 --00151747343aee0a6104a4cb6f20 Content-Type: text/plain; charset=ISO-8859-1 Thank you very much Peter ! After I disable the disk cache and change the cache write mode from write-back to "write-through", I saw the result I'd like to see. It seems fsync() only synced the data to the disk cache but not the storage devices while disk cache sync mode in write-back. But I have another question, while I disable the disk cache but leave the cache write mode write-back, how sync works ? Still write the data into the cache ? This issue may not belong to the scope of discussion here [?] . Thank you all ! 2011/6/3 Peter Schuller > > I disable the disk cache of RAID controller, unfortunately it still lost > > some data. > > Disabling caching shouldn't be necessary so much as ensuring that all > layers honor write barriers properly. A battery backed cache that > survives a power outtage need not be disabled (and usually if you have > battery backed caching you don't want to since it has a considerable > performance impact). > > To re-address your original post: Yes, given QUORUM @ RF=2 (meaning > that QUORUM is equivalent to ALL), any *successful* write is supposed > to be guaranteed to be visible by a subsequent read. In this case even > at CL.ONE since RF was 2 and QUORUM was equivalent to ALL. > > If this is not what you're seeing, likely causes are either (a) a > problem with your test, (b) a cassandra bug, or (c) a kernel/hardware > misconfiguration or bug that causes fsync() to be broken with respect > to power outtages. > > In order to eliminate (a), can you share the actual test? Even if (a) > looks good, you'd be surprised as to how often (c) can be the case. > > If you are satisfied that the test is correct, one way to eliminate > Cassandra as a cause for the problem may be to restart your server by > a reset instead of cutting power, so that power supply never > disappears from your storage device. If you are no longer able to > reproduce the problem, it would indicate that fsync() is at least > causing I/O to reach a device (exit the operating system). If it still > fails, you're none the wiser. > > If you're running without battery backed cache, or with battery backed > cache, one test you can do is run this (on a system which is otherwise > idle): > > http://distfiles.scode.org/mlref/fsynctime.py > > The first argument is a filename which will be created/over-written. > It will then start printing the number of milliseconds each fsync() > takes. If you do not have battery backed caching, you should be seeing > numbers in the 5-25 ms range depending on circumstances. If you see > very low values, that indicates that fsync() is not working and the > writes are not forced to persistent storage. > > (If battery backed caching exists, you will legitimiately get very low > values without it indicating anything is wrong.) > > > -- > / Peter Schuller > -- by Preston Chang --00151747343aee0a6104a4cb6f20 Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable Thank you very much Peter !

After I disable the disk cac= he and change the cache write mode from write-back to "write-through&q= uot;, I saw the result I'd like to see.=A0

It = seems fsync() only synced the data to the disk cache but not the storage de= vices while disk cache sync mode in write-back.

But I have another question, while I disable the disk c= ache but leave the cache write mode write-back, how sync works ? Still writ= e the data into the cache ? This issue may not belong to the scope of discu= ssion here=A0=A0.

Thank you all !

= 2011/6/3 Peter Schuller <peter.schuller@infidyne.com>
> I disable the disk cache of RAID controller, =A0unfo= rtunately it still lost
> some data.

Disabling caching shouldn't be necessary so much as ensuring that= all
layers honor write barriers properly. A battery backed cache that
survives a power outtage need not be disabled (and usually if you have
battery backed caching you don't want to since it has a considerable performance impact).

To re-address your original post: Yes, given QUORUM @ RF=3D2 (meaning
that QUORUM is equivalent to ALL), any *successful* write is supposed
to be guaranteed to be visible by a subsequent read. In this case even
at CL.ONE since RF was 2 and QUORUM was equivalent to ALL.

If this is not what you're seeing, likely causes are either (a) a
problem with your test, (b) a cassandra bug, or (c) a kernel/hardware
misconfiguration or bug that causes fsync() to be broken with respect
to power outtages.

In order to eliminate (a), can you share the actual test? Even if (a)
looks good, you'd be surprised as to how often (c) can be the case.

If you are satisfied that the test is correct, one way to eliminate
Cassandra as a cause for the problem may be to restart your server by
a reset instead of cutting power, so that power supply never
disappears from your storage device. If you are no longer able to
reproduce the problem, it would indicate that fsync() is at least
causing I/O to reach a device (exit the operating system). If it still
fails, you're none the wiser.

If you're running without battery backed cache, or with battery backed<= br> cache, one test you can do is run this (on a system which is otherwise
idle):

=A0 http://distfiles.scode.org/mlref/fsynctime.py

The first argument is a filename which will be created/over-written.
It will then start printing the number of milliseconds each fsync()
takes. If you do not have battery backed caching, you should be seeing
numbers in the 5-25 ms range depending on circumstances. If you see
very low values, that indicates that fsync() is not working and the
writes are not forced to persistent storage.

(If battery backed caching exists, you will legitimiately get very low
values without it indicating anything is wrong.)


--
/ Peter Schuller



--
by Preston Chang=

--00151747343aee0a6104a4cb6f20-- --00151747343aee0a6604a4cb6f21 Content-Type: image/png; name="328.png" Content-Transfer-Encoding: base64 X-Attachment-Id: gtalk.328@goomoji.gmail Content-ID: iVBORw0KGgoAAAANSUhEUgAAAA4AAAAOCAYAAAAfSC3RAAAABGdBTUEAANbY1E9YMgAAAfBJREFU KM+VUs9LVGEUfX+AovlmHEtnahG2MLBlmBGkyIwEQU3mtBFFLNNa2FKobf7+2WBFQ7SrhaswS6VJ F9LCBDGnHCufIpWNGiSE3Hnf8d5vmBmDXLg4fIf7vvPuuee7BgAjCeo55KCOzFt8vqQHrigFXUvM R7jWwqdr7920qCvLF7ufs2GPnob66AesGg21cAX2mxJstJu/qTPr0j9CFnnpYf5fuYTNZiB2E1i/ kYDwrWaoT1cRf+wmvntRC9mCSX3mWkr0o+H/2GyC+lwN6nfGWHPYYP+3xd7ws0oUF5k4ddLE80fl KcFwqCJdD3lhj50Ba1ql4yuZyVfuwez4Zcy99cN73p0S+pin655E1z7zncHpfdFByDz72Uzi53Vg tRY0dGTZoGDeAYV1IrTE6msdTKwRK6Ol+4r0t1+NUIva6pSE0yLvJInODh3D/NNCfA+X4c+8X0O4 1D4E3fqOPXFWwrknHZ004LCk685yDayREsw9OY7pDgfed+VqLrWdlVqo6DXQgHOLurMLEgvQlnEn /qJIP7TNs9JaPbYXqjSE22IxGkA8dFTx9lQnF+AE9TsiauYCVKQKdvic3hJJT6AWAwl7g851FgVS K8fCBhrMjfDfLOrNmaT2zB4efoKT+8r4xjzMM91le569S74LIbokkvwws6oAAAAASUVORK5CYII= --00151747343aee0a6604a4cb6f21--