From user-return-37629-apmail-cassandra-user-archive=cassandra.apache.org@cassandra.apache.org Thu Nov 21 06:56:10 2013 Return-Path: X-Original-To: apmail-cassandra-user-archive@www.apache.org Delivered-To: apmail-cassandra-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 5D56D10FCA for ; Thu, 21 Nov 2013 06:56:10 +0000 (UTC) Received: (qmail 34768 invoked by uid 500); 21 Nov 2013 06:56:07 -0000 Delivered-To: apmail-cassandra-user-archive@cassandra.apache.org Received: (qmail 34743 invoked by uid 500); 21 Nov 2013 06:56:07 -0000 Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@cassandra.apache.org Delivered-To: mailing list user@cassandra.apache.org Received: (qmail 34675 invoked by uid 99); 21 Nov 2013 06:56:05 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 21 Nov 2013 06:56:05 +0000 X-ASF-Spam-Status: No, hits=3.7 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_NONE,SPF_PASS,TVD_PH_BODY_ACCOUNTS_PRE X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: local policy includes SPF record at spf.trusted-forwarder.org) Received: from [209.85.192.171] (HELO mail-pd0-f171.google.com) (209.85.192.171) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 21 Nov 2013 06:56:00 +0000 Received: by mail-pd0-f171.google.com with SMTP id z10so9145220pdj.2 for ; Wed, 20 Nov 2013 22:55:40 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:from:content-type:message-id:mime-version :subject:date:references:to:in-reply-to; bh=6vM+XHG645JLbtXDGAkOHBjVnJ0fLhdYtIoTvcyZ8x8=; b=RUC7iS1X0l7tWrc1HkXGWJPadut2w7gh3iy+SN+EXxcM64jEWywxvd/Sr+rPn/YGB2 dFScpD6yDHXBPkamfCrBmtArP61179DVWCf+zzy6DCnJWJ3zIUyUwbFBW4PGoXY46WXk gRHm/kgFCOf3Baq8ESc/esTS2AjFp1C9gKkXd8lhNyvOugVQwsGfaxME7s8KsN6+ZakO ab3nZSDVL4vdiBxTy7y2/4u3g0+6gPJPFL+5+kwUnAjtA4cLOB1SewLMpwGDh+o2G3lW tr2zPWKHpnAE9fAiRwmlBe016XHfBCr4S6M1HO8zYro+6Eswj5wtNHY3Rgf/3ZKIar4k qYog== X-Gm-Message-State: ALoCoQm7j/rlgPWJ26NGfiU47CNohPNX+nwJeZwNZcZEesOcegWV8ZsBjdI9qCs8rZbeiAdDBbxj X-Received: by 10.68.134.200 with SMTP id pm8mr4685183pbb.123.1385016939926; Wed, 20 Nov 2013 22:55:39 -0800 (PST) Received: from [172.16.1.20] ([203.86.207.101]) by mx.google.com with ESMTPSA id hw10sm43176078pbc.24.2013.11.20.22.55.38 for (version=TLSv1 cipher=ECDHE-RSA-RC4-SHA bits=128/128); Wed, 20 Nov 2013 22:55:39 -0800 (PST) From: Aaron Morton Content-Type: multipart/alternative; boundary="Apple-Mail=_A762F4B7-E055-42ED-819C-3D99FA5A9779" Message-Id: <221B8198-B9A1-4B49-B69C-4B0B61C88014@thelastpickle.com> Mime-Version: 1.0 (Mac OS X Mail 7.0 \(1822\)) Subject: Re: DESIGN QUESTION: Need to update only older data in cassandra Date: Thu, 21 Nov 2013 19:55:35 +1300 References: To: Cassandra User In-Reply-To: X-Mailer: Apple Mail (2.1822) X-Virus-Checked: Checked by ClamAV on apache.org --Apple-Mail=_A762F4B7-E055-42ED-819C-3D99FA5A9779 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset=windows-1252 > The problems occurs during the day where updates can be sent that = possibly contain older data then the nightly batch update.=20 If you have a an application level sequence for updates (I used that = term to avoid saying timestamp) you could use it as the cassandra = timestamp. As long as you know it increases it=92s fine. You can specify = the timestamp for a column via either thrift or cql3.=20 When the updates come in during the day if they have the older time = stamp just send the write and it will be ignored.=20 Cheers ----------------- Aaron Morton New Zealand @aaronmorton Co-Founder & Principal Consultant Apache Cassandra Consulting http://www.thelastpickle.com On 17/11/2013, at 8:45 am, Lawrence Turcotte = wrote: > that is, data consists of of an account id with a timestamp column = that indicates when the account was updated. This is not to be confused = with row insertion/update times tamp maintained by Cassandra for = conflict resolution within the Cassanda Nodes. Furthermore the account = has about 200 columns and updates occur nightly in batch mode where = roughly 300-400 million updates are sent. The problems occurs during the = day where updates can be sent that possibly contain older data then the = nightly batch update. As such the requirement to first look at the = account update time stamp in the database and comparing the proposed = update time stamp to determine whether to update or not. >=20 > The idea here is that a read before update in Cassandra is generally = not a good idea. To alleviate this problem I was thinking of either = maintaining a separate Cassandra db with only two columns of account id = and update time stamp and using this as a look up before updating or = setting a stored procedure within the main database to do the read and = update if the data within the database is older. >=20 > UPDATE Account SET some columns WHERE lastUpdateTimeStamp < = proposedUpdateTimeStamp. >=20 > I am kind of leaning towards the separate database or keys pace as a = simple look up to determine whether to update the data in the main = Cassandra database, that is the database that contain the 200 columns of = account data. If this is the best choice then I would like to explore = the pros and cons of creating a separate Cassandra Node cluster for look = up of account update time stamps vs just adding another key space within = the main Cassandra database in terms of performance implications. In = this account and time stamp only database I would need to also update = the time stamp when the main database would be updated. >=20 > Any thoughts are welcome >=20 > Lawrence >=20 --Apple-Mail=_A762F4B7-E055-42ED-819C-3D99FA5A9779 Content-Transfer-Encoding: quoted-printable Content-Type: text/html; charset=windows-1252
 The = problems occurs during the day where updates can be sent that possibly = contain older data then the nightly batch = update. 
If you have a an application level = sequence for updates (I used that term to avoid saying timestamp) you = could use it as the cassandra timestamp. As long as you know it = increases it=92s fine. You can specify the timestamp for a column via = either thrift or cql3. 

When the updates come in = during the day if they have the older time stamp just send the write and = it will be = ignored. 

Cheers

http://www.thelastpickle.com

On 17/11/2013, at 8:45 am, Lawrence Turcotte <lawrence.turcotte@gmail.com> wrote:

that is, data consists of of an account = id with a timestamp column that indicates when the account was updated. = This is not to be confused with row insertion/update times tamp = maintained by Cassandra for conflict resolution within the Cassanda = Nodes. Furthermore the account has about 200 columns and updates occur = nightly in batch mode where roughly 300-400 million updates are sent. = The problems occurs during the day where updates can be sent that = possibly contain older data then the nightly batch update. As such the = requirement to first look at the account update time stamp in the = database and comparing the proposed update time stamp to determine = whether to update or not.

The idea here is that a read before update in Cassandra = is generally not a good idea. To alleviate this problem I was thinking = of either maintaining a separate Cassandra db with only two columns of = account id and update time stamp and using this as a look up before = updating or setting a stored procedure within the main database to do = the read and update if the data within the database is older.

UPDATE Account SET some columns WHERE = lastUpdateTimeStamp < = proposedUpdateTimeStamp.

I am kind of leaning = towards the separate database or keys pace as a simple look up to = determine whether to update the data in the main Cassandra database, = that is the database that contain the 200 columns of account data. If = this is the best choice then I would like to explore the pros and cons = of creating a separate Cassandra Node cluster for look up of account = update time stamps vs just adding another key space within the main = Cassandra database in terms of performance implications. In this account = and time stamp only database I would need to also update the time stamp = when the main database would be updated.

Any thoughts are = welcome

Lawrence


= --Apple-Mail=_A762F4B7-E055-42ED-819C-3D99FA5A9779--