Return-Path: Delivered-To: apmail-cassandra-user-archive@www.apache.org Received: (qmail 20510 invoked from network); 7 Feb 2011 21:08:56 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 7 Feb 2011 21:08:56 -0000 Received: (qmail 78900 invoked by uid 500); 7 Feb 2011 21:08:53 -0000 Delivered-To: apmail-cassandra-user-archive@cassandra.apache.org Received: (qmail 78852 invoked by uid 500); 7 Feb 2011 21:08:53 -0000 Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@cassandra.apache.org Delivered-To: mailing list user@cassandra.apache.org Received: (qmail 78844 invoked by uid 99); 7 Feb 2011 21:08:53 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 07 Feb 2011 21:08:53 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=FREEMAIL_FROM,HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS,T_TO_NO_BRKTS_FREEMAIL X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of chirayithaj@gmail.com designates 209.85.161.44 as permitted sender) Received: from [209.85.161.44] (HELO mail-fx0-f44.google.com) (209.85.161.44) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 07 Feb 2011 21:08:48 +0000 Received: by fxm9 with SMTP id 9so5680582fxm.31 for ; Mon, 07 Feb 2011 13:08:27 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:in-reply-to:references:date :message-id:subject:from:to:content-type; bh=KnM1vBwLCH3swMN9PLNNNSmw3bFftkKEskFaRlaYJLQ=; b=BGgNtRCJ+IkzJ253K4TqSPYpBTWtc7tHnzPlR0RMVpnpF5I84Sx1ne0otIMBu4HFmI rKDpHeL3+7M4rnqgqT6SbVuGKSFTciSBykBgGxF3r2QC7Urz1gLlYr8icEAlQtoMixKH FRPoFyvFOQEAq35IWv+qn7cKQA0neAsuB1m60= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; b=Zc8SB1jCXjzLY9qVXATkR89SiXPf1lfc6Fk0qVVDgHM7H2ooVeqC6aJ3SuhHOC6B4t HbkyikhOL3eivq9VLf0sJM+wDmMMMssJL4VfhA+I99FvI6/TY0+GFtgvCuDV5GZVhsvf xnO/JpG7IwSs3MAfpFDXQz40+kW3HlCwAwuoI= MIME-Version: 1.0 Received: by 10.223.95.203 with SMTP id e11mr15481789fan.60.1297112906833; Mon, 07 Feb 2011 13:08:26 -0800 (PST) Received: by 10.223.86.144 with HTTP; Mon, 7 Feb 2011 13:08:26 -0800 (PST) In-Reply-To: References: Date: Mon, 7 Feb 2011 15:08:26 -0600 Message-ID: Subject: Re: Best way to detect/fix bitrot today? From: Anthony John To: user@cassandra.apache.org Content-Type: multipart/alternative; boundary=00248c11dcc90b9fe5049bb7a213 --00248c11dcc90b9fe5049bb7a213 Content-Type: text/plain; charset=ISO-8859-1 Some RAID storage might do it, potentially more efficiently!! Rhetorical question - Does Cassandra's architecture of reconciling reads over multiple copies of the same data provide an even more interesting answer? I submit - YES! All bitrot protection mechanisms involve some element of redundant storage - to verify and reconstruct any rot. Cassandra can do this on JBODs with the appropriate Replication Factor (say > 3). Granted that the total storage in terms of number of disks might exceed the other alternatives, but at the lowest tier, using JBODs, the cost might actually be lesser. Food for thought, or wild imagination ? -JA On Mon, Feb 7, 2011 at 2:09 PM, Peter Schuller wrote: > > Our application space is such that there is data that might not be read > for > > a long time. The data is mostly immutable. How should I approach > > detecting/solving the bitrot problem? One approach is read data and let > read > > repair do the detection, but given the size of data, that does not look > very > > efficient. > > Note that read-repair is not really intended to repair arbitrary > corruptions. Unless I'm mistaken, arbitrary corruption, unless it > triggers a serialization failure that causes row skipping, it's a > toss-up which version of the data is retained (or both, if the > corruption is in the key). Given the same key and column timestamp, > the tie breaker is the volumn value. So depending on whether > corruption results in a "lesser" or "greater" value, you might get the > corrupt or non-corrupt data. > > > Has anybody solved/workaround this or has any other suggestions to detect > > and fix bitrot? > > My feel/tentative opinion is that the clean fix is for Cassandra to > support strong checksumming at the sstable level. > > Deploying on e.g. ZFS would help a lot with this, but that's a problem > for deployment on Linux (which is the recommended platform for > Cassandra). > > -- > / Peter Schuller > --00248c11dcc90b9fe5049bb7a213 Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable Some RAID storage might do it, potentially more efficiently!!

Rhetorical question - Does Cassandra's architecture of reconcilin= g reads over multiple copies of the same data provide an even more interest= ing answer? I submit - YES!

All bitrot protection mechanisms involve some element o= f redundant storage - to verify and reconstruct any rot. Cassandra can do t= his on JBODs with the appropriate Replication Factor (say > 3). Granted = that the total storage in terms of number of disks might exceed the other a= lternatives, but at the lowest tier, using JBODs, the cost might actually b= e lesser.

Food for thought, or wild imagination ?

<= /div>
-JA

On Mon, Feb 7, 2011 = at 2:09 PM, Peter Schuller <peter.schuller@infidyne.com> wrote:
> Our application spac= e is such that there is data that might not be read for
> a long time. The data is mostly immutable. How should I approach
> detecting/solving the bitrot problem? One approach is read data and le= t read
> repair do the detection, but given the size of data, that does not loo= k very
> efficient.

Note that read-repair is not really intended to repair arbitrary
corruptions. Unless I'm mistaken, arbitrary corruption, unless it
triggers a serialization failure that causes row skipping, it's a
toss-up which version of the data is retained (or both, if the
corruption is in the key). Given the same key and column timestamp,
the tie breaker is the volumn value. So depending on whether
corruption results in a "lesser" or "greater" value, yo= u might get the
corrupt or non-corrupt data.

> Has anybody solved/workaround this or has any other suggestions to det= ect
> and fix bitrot?

My feel/tentative opinion is that the clean fix is for Cassandra to support strong checksumming at the sstable level.

Deploying on e.g. ZFS would help a lot with this, but that's a problem<= br> for deployment on Linux (which is the recommended platform for
Cassandra).

--
/ Peter Schuller

--00248c11dcc90b9fe5049bb7a213--