cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Robert Wille <rwi...@fold3.com>
Subject Re: Updating only modified records (where lastModified < current date)
Date Wed, 13 May 2015 13:15:24 GMT
You probably shouldn’t use batch updates. Your records are probably unrelated to each other,
and therefore there really is no reason to use batches. Use asynchronous queries to improve
performance. executeAsync() is your friend.

A common misconception is that batches will improve performance. They don’t. Mostly they
just increase the load on your cluster.

In my project, I have written a collection of classes that help me manage asynchronous queries.
They aren’t complicated and didn’t take very long to write, but they take away most of
the pain that occurs when you need to execute a whole bunch of asynchronous queries, and want
to meter them out, wait for them to complete, etc. I probably execute 75% of my queries asynchronously.
Its relatively painless.

On May 13, 2015, at 6:51 AM, Ali Akhtar <ali.rac200@gmail.com<mailto:ali.rac200@gmail.com>>
wrote:

Can lightweight txns be used in a batch update?

On Wed, May 13, 2015 at 5:48 PM, Ali Akhtar <ali.rac200@gmail.com<mailto:ali.rac200@gmail.com>>
wrote:
The 6k is only the starting value, its expected to scale up to ~200 million records.

On Wed, May 13, 2015 at 5:44 PM, Robert Wille <rwille@fold3.com<mailto:rwille@fold3.com>>
wrote:
You could use lightweight transactions to update only if the record is newer. It doesn’t
avoid the read, it just happens under the covers, so it’s not really going to be faster
compared to a read-before-write pattern (which is an anti-pattern, BTW). It is probably the
easiest way to avoid getting a whole bunch of copies of each record.

But even with a read-before-write pattern, I don’t understand why you are worried about
6K records per hour. That’s nothing. You’re probably looking at several milliseconds to
do the read and write for each record (depending on your storage, RF and CL), so you’re
probably looking at under a minute to do 6K records. If you do them in parallel, you’re
probably looking at several seconds. I don’t get why something that probably takes less
than a minute that is done once an hour is a problem.

BTW, I wouldn’t do all 6K in parallel. I’d use some kind of limiter (e.g. a semaphore)
to ensure that you don’t execute more than X queries at a time.

Robert

On May 13, 2015, at 6:20 AM, Ali Akhtar <ali.rac200@gmail.com<mailto:ali.rac200@gmail.com>>
wrote:

But your previous email talked about when T1 is different:

> Assume timestamp T1 < T2 and you stored value V with timestamp T2. Then you store
V’ with timestamp T1.

What if you issue an update twice, but with the same timestamp? E.g if you ran:

Update .... where foo=bar USING TIMESTAMP = 10000000

and 1 hour later, you ran exactly the same query again. In this case, the value of T is the
same for both queries. Would that still cause multiple values to be stored?

On Wed, May 13, 2015 at 5:17 PM, Peer, Oded <Oded.Peer@rsa.com<mailto:Oded.Peer@rsa.com>>
wrote:
It will cause an overhead (compaction and read) as I described in the previous email.

From: Ali Akhtar [mailto:ali.rac200@gmail.com<mailto:ali.rac200@gmail.com>]
Sent: Wednesday, May 13, 2015 3:13 PM

To: user@cassandra.apache.org<mailto:user@cassandra.apache.org>
Subject: Re: Updating only modified records (where lastModified < current date)


> I don’t understand the ETL use case and its relevance here. Can you provide more details?

Basically, every 1 hour a job runs which queries an external API and gets some records. Then,
I want to take only new or updated records, and insert / update them in cassandra. For records
that are already in cassandra and aren't modified, I want to ignore them.

Each record returns a lastModified datetime, I want to use that to determine whether a record
was changed or not (if it was, it'd be updated, if not, it'd be ignored).

The issue was, I'm having to do a 'select lastModified from table where id = ?' query for
every record, in order to determine if db lastModified < api lastModified or not. I was
wondering if there was a way to avoid that.

If I use 'USING TIMESTAMP', would subsequent updates where lastModified is a value that was
previously used, still create that overhead, or will they be ignored?

E.g if I issued an update where TIMESTAMP is X, then 1 hour later I issued another update
where TIMESTAMP is still X, will that 2nd update essentially get ignored, or will it cause
any overhead?

On Wed, May 13, 2015 at 5:02 PM, Peer, Oded <Oded.Peer@rsa.com<mailto:Oded.Peer@rsa.com>>
wrote:
USING TIMESTAMP doesn’t avoid compaction overhead.
When you modify data the value is stored along with a timestamp indicating the timestamp of
the value.
Assume timestamp T1 < T2 and you stored value V with timestamp T2. Then you store V’
with timestamp T1.
Now you have two values of V in the DB: <V,T2>, <V’,T1>
When you read the value of V from the DB you read both <V,T2>, <V’,T1>, Cassandra
resolves the conflict by comparing the timestamp and returns V.
Compaction will later take care and remove <V’,T1> from the DB.

I don’t understand the ETL use case and its relevance here. Can you provide more details?

UPDATE in Cassandra updates specific rows. All of them are updated, nothing is ignored.


From: Ali Akhtar [mailto:ali.rac200@gmail.com<mailto:ali.rac200@gmail.com>]
Sent: Wednesday, May 13, 2015 2:43 PM

To: user@cassandra.apache.org<mailto:user@cassandra.apache.org>
Subject: Re: Updating only modified records (where lastModified < current date)

Its rare for an existing record to have changes, but the etl job runs every hour, therefore
it will send updates each time, regardless of whether there were changes or not.

(I'm assuming that USING TIMESTAMP here will avoid the compaction overhead, since that will
cause it to not run any updates unless the timestamp is actually > last update timestamp?)

Also, is there a way to get the number of rows which were updated / ignored?

On Wed, May 13, 2015 at 4:37 PM, Peer, Oded <Oded.Peer@rsa.com<mailto:Oded.Peer@rsa.com>>
wrote:
The cost of issuing an UPDATE that won’t update anything is compaction overhead. Since you
stated it’s rare for rows to be updated then the overhead should be negligible.

The easiest way to convert a milliseconds timestamp long value to microseconds is to multiply
by 1000.

From: Ali Akhtar [mailto:ali.rac200@gmail.com<mailto:ali.rac200@gmail.com>]
Sent: Wednesday, May 13, 2015 2:15 PM
To: user@cassandra.apache.org<mailto:user@cassandra.apache.org>
Subject: Re: Updating only modified records (where lastModified < current date)

Would TimeUnit.MILLISECONDS.toMicros(  myDate.getTime() ) work for producing the microsecond
timestamp ?

On Wed, May 13, 2015 at 4:09 PM, Ali Akhtar <ali.rac200@gmail.com<mailto:ali.rac200@gmail.com>>
wrote:
If specifying 'using' timestamp, the docs say to provide microseconds, but where are these
microseconds obtained from? I have regular java.util.Date objects, I can get the time in milliseconds
(i.e the unix timestamp), how would I convert that to microseconds?

On Wed, May 13, 2015 at 3:56 PM, Ali Akhtar <ali.rac200@gmail.com<mailto:ali.rac200@gmail.com>>
wrote:
Thanks Peter, that's interesting. I didn't know of that option.

If updates don't create tombstones (and i'm already taking pains to ensure no nulls are present
in queries), then is there no cost to just submitting an update for everything regardless
of whether lastModified has changed?

Thanks.

On Wed, May 13, 2015 at 3:38 PM, Peer, Oded <Oded.Peer@rsa.com<mailto:Oded.Peer@rsa.com>>
wrote:
You can use the “last modified” value as the TIMESTAMP for your UPDATE operation.
This way the values will only be updated if lastModified date > the lastModified you have
in the DB.

Updates to values don’t create tombstones. Only deletes (either by executing delete, inserting
a null value or by setting a TTL) create tombstones.


From: Ali Akhtar [mailto:ali.rac200@gmail.com<mailto:ali.rac200@gmail.com>]
Sent: Wednesday, May 13, 2015 1:27 PM
To: user@cassandra.apache.org<mailto:user@cassandra.apache.org>
Subject: Updating only modified records (where lastModified < current date)

I'm running some ETL jobs, where the pattern is the following:

1- Get some records from an external API,

2- For each record, see if its lastModified date > the lastModified i have in db (or if
I don't have that record in db)

3- If lastModified < dbLastModified, the item wasn't changed, ignore it. Otherwise, run
an update query and update that record.

(It is rare for existing records to get updated, so I'm not that concerned about tombstones).

The problem however is, since I have to query each record's lastModified, one at a time, that's
adding a major bottleneck to my job.

E.g if I have 6k records, I have to run a total of 6k 'select lastModified from myTable where
id = ?' queries.

Is there a better way, am I doing anything wrong, etc? Any suggestions would be appreciated.

Thanks.











Mime
View raw message