From Kim Liu
Subject Re: Using User Defined Functions in UPDATE queries
Date Fri, 11 Mar 2016 16:09:22 GMT
Just for sake of clarification, then, what is the use-case for having UDFs in an UPDATE?

If they cannot read data from the data store, then all of the parameters to the UDF must be
supplied by the client, correct?

If the client has all the parameters, the client could perform the equivalent of the UDF on
the client side, first, then send the results to the server, instead of pushing the computation
work onto the server.  So I am curious as to what one is supposed to use a UDF in an UPDATE

Long-winded explanation of the use-case I was poking at using UPDATE UDFs for below for the
morbidly curious.

That morbidly curious, huh?

The scenario is, roughly, that the application receives a set of data which is broken up over,
say, four messages (A,B,C,D).  However, the messages can arrive in any order, possibly with
duplicates, and the data set is not complete until the all four messages are received.  There
are multiple message receivers in order to scale to the volume of messages coming in, so each
of the four messages per data set could arrive at any receiver (in any chronological pattern),
and each receiving station would then insert the partial data into Cassandra.

I looked at the Cassandra SET implementation, thinking that I could just add ‘A’, ‘B’,
‘C’, ‘D’ (or 1,2,3,4) to a set with a secondary index.  Then periodically search for
where the set had all elements to spot which rows had a complete data set ready for processing.
 However, there does not appear to be an equality check for SETs.  (Adding elements to a set
is another place where UPDATE appears to allow for the “x = x <operator> <data>”
pattern which added to my confusion about using a UDF in the UPDATE.)

So instead of using sets, the idea was to have a UDF perform a bit-wise OR operation.  Roughly:
AS 'return Integer.valueOf((a == null ? 0 : a)|(b == null ? 0 : b));';

Then as each message segment came in, I had intended, roughly:
  UPDATE MessageData SET messageComplete = bitwise_or(messageComplete,2), data2=… ;
  UPDATE MessageData SET messageComplete = bitwise_or(messageComplete,1), data1=… ;
  UPDATE MessageData SET messageComplete = bitwise_or(messageComplete,8), data4=… ;
  UPDATE MessageData SET messageComplete = bitwise_or(messageComplete,4), data3=… ;

Then, with a secondary index on ‘messageComplete’, periodically scrape out all rows where
messageComplete was equal to 15.  (At most, sixteen unique values in the secondary index.)
 (And use a TTL to expire messages that did not eventually complete, etc.  Boilerplate infrastructure,

This was based upon my incorrect assumption about UPDATE UDFs, since this looked like an optimal
way to avoid having all the clients perform read-updates patterns and worrying about the clients
stepping on each others data, as well as handling cases where duplicate messages were received
by different receivers.  So it’s starting to look like I might need to use something else
to perform the correlation between messages.


From: Sylvain Lebresne
Reply-To: "<>" <<>>
Date: Friday, March 11, 2016 at 00:35
To: "<>" <<>>
Subject: Re: Using User Defined Functions in UPDATE queries

UDF are usable in UPDATE statement as actually trying them shows, it's just the documented
grammar that needs fixing.

But as far as doing something like:
  UPDATE test_table SET data=max_int(data,5) WHERE idx='abc’;
this is indeed *not* supported and likely never will. One big pillar of C* design is that
normal writes like this don't do a read-before-write, both for performance and because of
consistency constraints, so we can't have update depend on the previous value in any way.
I'll note that maybe that make UDF useless for you and if so, I'm sorry, but you just can't
use UDF in C* for that and you'd have to do a manual read-before-write client side to achieve

For the sake of avoiding confusion, I will not that we do allow:
  UPDATE test_table SET c = c + 1 WHERE idx='abc';
if c is a counter, but that's a very special case. Counters have a completely separate path
and implementation and do have a read-before-write (and are slower than normal update as a

