Return-Path: X-Original-To: apmail-cassandra-user-archive@www.apache.org Delivered-To: apmail-cassandra-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 8FEA218029 for ; Wed, 13 May 2015 12:13:04 +0000 (UTC) Received: (qmail 84760 invoked by uid 500); 13 May 2015 12:13:01 -0000 Delivered-To: apmail-cassandra-user-archive@cassandra.apache.org Received: (qmail 84721 invoked by uid 500); 13 May 2015 12:13:01 -0000 Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@cassandra.apache.org Delivered-To: mailing list user@cassandra.apache.org Received: (qmail 84711 invoked by uid 99); 13 May 2015 12:13:01 -0000 Received: from Unknown (HELO spamd2-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 13 May 2015 12:13:01 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd2-us-west.apache.org (ASF Mail Server at spamd2-us-west.apache.org) with ESMTP id EBE251A2AFC for ; Wed, 13 May 2015 12:13:00 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd2-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 4.349 X-Spam-Level: **** X-Spam-Status: No, score=4.349 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, FREEMAIL_ENVFROM_END_DIGIT=0.25, HTML_MESSAGE=3, KAM_LINEPADDING=1.2, RCVD_IN_MSPIKE_H2=-0.001, SPF_PASS=-0.001, URIBL_BLOCKED=0.001] autolearn=disabled Authentication-Results: spamd2-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx1-us-east.apache.org ([10.40.0.8]) by localhost (spamd2-us-west.apache.org [10.40.0.9]) (amavisd-new, port 10024) with ESMTP id qAX_yOQO_oJD for ; Wed, 13 May 2015 12:12:53 +0000 (UTC) Received: from mail-qg0-f46.google.com (mail-qg0-f46.google.com [209.85.192.46]) by mx1-us-east.apache.org (ASF Mail Server at mx1-us-east.apache.org) with ESMTPS id 04A5D40E1C for ; Wed, 13 May 2015 12:12:53 +0000 (UTC) Received: by qgdy78 with SMTP id y78so19978442qgd.0 for ; Wed, 13 May 2015 05:12:52 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; bh=WcXO5Hbr7TDez+LOGrJcLF+WFpq2L6vZAAU5m8ow7MQ=; b=zr+/g0WkYS2R5d1SXjqupPgqCAlEpCyhviJ05ASwknIXoBWSgv5yasd63/DzQjABkG V9pXxMaW7QFNHl/QCB15vnGgf17M9Hdj9L68+WoNQfjcGy3nsDmdUf4Z08CRU8628uWU FExBMGPaLALP1N2NG5gvBD82DwidMGTuf7nAdiASQokffUmWo4rHTWot5p3bm+jchx9H 3dSLkPIA5oG80VppkSYZWk1wIyQG21SoICyrKte9cjL0fzBfa0WJFzdaOHxVuxqwkHVu lPKaWuK63OV5jbzTC2rH6CjByRWjNVhoDBVJvzmRxK8xJnM5FOtaSrPBIuD/LumFR5sq Bv8A== MIME-Version: 1.0 X-Received: by 10.140.32.34 with SMTP id g31mr25453182qgg.74.1431519172596; Wed, 13 May 2015 05:12:52 -0700 (PDT) Received: by 10.140.29.225 with HTTP; Wed, 13 May 2015 05:12:52 -0700 (PDT) In-Reply-To: <889B05DA4CA1424799D672CFCF9170EFA7730DDC@MX102CL01.corp.emc.com> References: <889B05DA4CA1424799D672CFCF9170EFA7730561@MX102CL01.corp.emc.com> <889B05DA4CA1424799D672CFCF9170EFA7730D34@MX102CL01.corp.emc.com> <889B05DA4CA1424799D672CFCF9170EFA7730DDC@MX102CL01.corp.emc.com> Date: Wed, 13 May 2015 17:12:52 +0500 Message-ID: Subject: Re: Updating only modified records (where lastModified < current date) From: Ali Akhtar To: user@cassandra.apache.org Content-Type: multipart/alternative; boundary=001a113a6334c5b8ac0515f58726 --001a113a6334c5b8ac0515f58726 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable > I don=E2=80=99t understand the ETL use case and its relevance here. Can y= ou provide more details? Basically, every 1 hour a job runs which queries an external API and gets some records. Then, I want to take only new or updated records, and insert / update them in cassandra. For records that are already in cassandra and aren't modified, I want to ignore them. Each record returns a lastModified datetime, I want to use that to determine whether a record was changed or not (if it was, it'd be updated, if not, it'd be ignored). The issue was, I'm having to do a 'select lastModified from table where id =3D ?' query for every record, in order to determine if db lastModified < a= pi lastModified or not. I was wondering if there was a way to avoid that. If I use 'USING TIMESTAMP', would subsequent updates where lastModified is a value that was previously used, still create that overhead, or will they be ignored? E.g if I issued an update where TIMESTAMP is X, then 1 hour later I issued another update where TIMESTAMP is still X, will that 2nd update essentially get ignored, or will it cause any overhead? On Wed, May 13, 2015 at 5:02 PM, Peer, Oded wrote: > USING TIMESTAMP doesn=E2=80=99t avoid compaction overhead. > > When you modify data the value is stored along with a timestamp indicatin= g > the timestamp of the value. > > Assume timestamp T1 < T2 and you stored value V with timestamp T2. Then > you store V=E2=80=99 with timestamp T1. > > Now you have two values of V in the DB: , > > When you read the value of V from the DB you read both , , > Cassandra resolves the conflict by comparing the timestamp and returns V. > > Compaction will later take care and remove from the DB. > > > > I don=E2=80=99t understand the ETL use case and its relevance here. Can y= ou > provide more details? > > > > UPDATE in Cassandra updates specific rows. All of them are updated, > nothing is ignored. > > > > > > *From:* Ali Akhtar [mailto:ali.rac200@gmail.com] > *Sent:* Wednesday, May 13, 2015 2:43 PM > > *To:* user@cassandra.apache.org > *Subject:* Re: Updating only modified records (where lastModified < > current date) > > > > Its rare for an existing record to have changes, but the etl job runs > every hour, therefore it will send updates each time, regardless of wheth= er > there were changes or not. > > > > (I'm assuming that USING TIMESTAMP here will avoid the compaction > overhead, since that will cause it to not run any updates unless the > timestamp is actually > last update timestamp?) > > > > Also, is there a way to get the number of rows which were updated / > ignored? > > > > On Wed, May 13, 2015 at 4:37 PM, Peer, Oded wrote: > > The cost of issuing an UPDATE that won=E2=80=99t update anything is compa= ction > overhead. Since you stated it=E2=80=99s rare for rows to be updated then = the > overhead should be negligible. > > > > The easiest way to convert a milliseconds timestamp long value to > microseconds is to multiply by 1000. > > > > *From:* Ali Akhtar [mailto:ali.rac200@gmail.com] > *Sent:* Wednesday, May 13, 2015 2:15 PM > *To:* user@cassandra.apache.org > *Subject:* Re: Updating only modified records (where lastModified < > current date) > > > > Would TimeUnit.MILLISECONDS.toMicros( myDate.getTime() ) work for > producing the microsecond timestamp ? > > > > On Wed, May 13, 2015 at 4:09 PM, Ali Akhtar wrote: > > If specifying 'using' timestamp, the docs say to provide microseconds, bu= t > where are these microseconds obtained from? I have regular java.util.Date > objects, I can get the time in milliseconds (i.e the unix timestamp), how > would I convert that to microseconds? > > > > On Wed, May 13, 2015 at 3:56 PM, Ali Akhtar wrote: > > Thanks Peter, that's interesting. I didn't know of that option. > > > > If updates don't create tombstones (and i'm already taking pains to ensur= e > no nulls are present in queries), then is there no cost to just submittin= g > an update for everything regardless of whether lastModified has changed? > > > > Thanks. > > > > On Wed, May 13, 2015 at 3:38 PM, Peer, Oded wrote: > > You can use the =E2=80=9Clast modified=E2=80=9D value as the TIMESTAMP fo= r your UPDATE > operation. > > This way the values will only be updated if lastModified date > the > lastModified you have in the DB. > > > > Updates to values don=E2=80=99t create tombstones. Only deletes (either b= y > executing delete, inserting a null value or by setting a TTL) create > tombstones. > > > > > > *From:* Ali Akhtar [mailto:ali.rac200@gmail.com] > *Sent:* Wednesday, May 13, 2015 1:27 PM > *To:* user@cassandra.apache.org > *Subject:* Updating only modified records (where lastModified < current > date) > > > > I'm running some ETL jobs, where the pattern is the following: > > > > 1- Get some records from an external API, > > > > 2- For each record, see if its lastModified date > the lastModified i hav= e > in db (or if I don't have that record in db) > > > > 3- If lastModified < dbLastModified, the item wasn't changed, ignore it. > Otherwise, run an update query and update that record. > > > > (It is rare for existing records to get updated, so I'm not that concerne= d > about tombstones). > > > > The problem however is, since I have to query each record's lastModified, > one at a time, that's adding a major bottleneck to my job. > > > > E.g if I have 6k records, I have to run a total of 6k 'select lastModifie= d > from myTable where id =3D ?' queries. > > > > Is there a better way, am I doing anything wrong, etc? Any suggestions > would be appreciated. > > > > Thanks. > > > > > > > > > --001a113a6334c5b8ac0515f58726 Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable
> I don=E2=80=99t understand the ETL use case and its r= elevance here. Can you provide more details?

Basical= ly, every 1 hour a job runs which queries an external API and gets some rec= ords. Then, I want to take only new or updated records, and insert / update= them in cassandra. For records that are already in cassandra and aren'= t modified, I want to ignore them.

Each record ret= urns a lastModified datetime, I want to use that to determine whether a rec= ord was changed or not (if it was, it'd be updated, if not, it'd be= ignored).

The issue was, I'm having to do a &= #39;select lastModified from table where id =3D ?' query for every reco= rd, in order to determine if db lastModified < api lastModified or not. = I was wondering if there was a way to avoid that.

= If I use 'USING TIMESTAMP', would subsequent updates where lastModi= fied is a value that was previously used, still create that overhead, or wi= ll they be ignored?

E.g if I issued an update wher= e TIMESTAMP is X, then 1 hour later I issued another update where TIMESTAMP= is still X, will that 2nd update essentially get ignored, or will it cause= any overhead?

On Wed, May 13, 2015 at 5:02 PM, Peer, Oded <= Oded.Peer@rsa.com> wrote:

USING TIMESTAMP doesn=E2= =80=99t avoid compaction overhead.

When you modify data the = value is stored along with a timestamp indicating the timestamp of the valu= e.

Assume timestamp T1 < = T2 and you stored value V with timestamp T2. Then you store V=E2=80=99 with= timestamp T1.

Now you have two values o= f V in the DB: <V,T2>, <V=E2=80=99,T1>

When you read the value o= f V from the DB you read both <V,T2>, <V=E2=80=99,T1>, Cassandr= a resolves the conflict by comparing the timestamp and returns V.=

Compaction will later tak= e care and remove <V=E2=80=99,T1> from the DB.

=C2=A0

I don=E2=80=99t understan= d the ETL use case and its relevance here. Can you provide more details?=

=C2=A0

UPDATE in Cassandra updat= es specific rows. All of them are updated, nothing is ignored.

=C2=A0

=C2=A0

From: Ali Akht= ar [mailto:ali.ra= c200@gmail.com]
Sent: Wednesday, May 13, 2015 2:43 PM


To: u= ser@cassandra.apache.org
Subject: Re: Updating only modified records (where lastModified <= current date)

=C2=A0

Its rare for an existing record to have changes, but= the etl job runs every hour, therefore it will send updates each time, reg= ardless of whether there were changes or not.

=C2=A0

(I'm assuming that USING TIMESTAMP here will avo= id the compaction overhead, since that will cause it to not run any updates= unless the timestamp is actually > last update timestamp?)

=C2=A0

Also, is there a way to get the number of rows which= were updated / ignored?

=C2=A0

On Wed, May 13, 2015 at 4:37 PM, Peer, Oded <Oded.Peer@rsa.com>= ; wrote:

The cost of issuing an UP= DATE that won=E2=80=99t update anything is compaction overhead. Since you s= tated it=E2=80=99s rare for rows to be updated then the overhead should be negli= gible.

=C2=A0

The easiest way to conver= t a milliseconds timestamp long value to microseconds is to multiply by 1000.

=C2=A0

From: Ali Akht= ar [mailto:ali.ra= c200@gmail.com]
Sent: Wednesday, May 13, 2015 2:15 PM
To: u= ser@cassandra.apache.org
Subject: Re: Updating only modified records (where lastModified <= current date)

=C2=A0

Would TimeUnit.MILLISECONDS.toMicros( =C2=A0myDate.g= etTime() ) work for producing the microsecond timestamp ?

=C2=A0

On Wed, May 13, 2015 at 4:09 PM, Ali Akhtar <ali.rac200@gmail.com<= /a>> wrote:

If specifying 'u= sing' timestamp, the docs say to provide microseconds, but where are th= ese microseconds obtained from? I have regular java.util.Date objects, I can get the time in milliseconds (i.e the unix timestamp), how = would I convert that to microseconds?

=C2=A0

On Wed, May 13, 2015 at 3:56 PM, Ali Akhtar <ali.rac200@gmail.com<= /a>> wrote:

Thanks Peter, that's interesting. I didn't k= now of that option.

=C2=A0

If updates don't create tombstones (and i'm = already taking pains to ensure no nulls are present in queries), then is th= ere no cost to just submitting an update for everything regardless of whether lastModified has changed?

=C2=A0

Thanks.

=C2=A0

On Wed, May 13, 2015 at 3:38 PM, Peer, Oded <Oded.Peer@rsa.com>= ; wrote:

You can use the =E2=80=9C= last modified=E2=80=9D value as the TIMESTAMP for your UPDATE operation.

This way the values will = only be updated if lastModified date > the lastModified you have in the DB.

=C2=A0

Updates to values don=E2= =80=99t create tombstones. Only deletes (either by executing delete, insert= ing a null value or by setting a TTL) create tombstones.

=C2=A0

=C2=A0

From: Ali Akht= ar [mailto:ali.ra= c200@gmail.com]
Sent: Wednesday, May 13, 2015 1:27 PM
To: u= ser@cassandra.apache.org
Subject: Updating only modified records (where lastModified < cur= rent date)

=C2=A0

I'm running some ETL jobs, where the pattern is = the following:

=C2=A0

1- Get some records from an external API,<= /u>

=C2=A0

2- For each record, see if its lastModified date >= ; the lastModified i have in db (or if I don't have that record in db)<= u>

=C2=A0

3- If lastModified < dbLastModified, the item was= n't changed, ignore it. Otherwise, run an update query and update that = record.

=C2=A0

(It is rare for existing records to get updated, so = I'm not that concerned about tombstones).

=C2=A0

The problem however is, since I have to query each r= ecord's lastModified, one at a time, that's adding a major bottlene= ck to my job.

=C2=A0

E.g if I have 6k records, I have to run a total of 6= k 'select lastModified from myTable where id =3D ?' queries.=

=C2=A0

Is there a better way, am I doing anything wrong, et= c? Any suggestions would be appreciated.

=C2=A0

Thanks.

=C2=A0

=C2=A0

=C2=A0

=C2=A0


--001a113a6334c5b8ac0515f58726--