cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Robert Stupp (JIRA)" <>
Subject [jira] [Commented] (CASSANDRA-10367) Aggregate with Initial Condition fails with C* 3.0
Date Fri, 18 Sep 2015 10:30:04 GMT


Robert Stupp commented on CASSANDRA-10367:

Sure. It's all about maintaining the state of the aggregate.

The current flow for UDAs is (roughly) like this
# create initial state variable instance, serialized (the collection in this case)
# for each row
## deserialize state variable instance (the collection in this case)
## call UDA state function with deserialized state variable and row's column value
## UDA state function modifies state variable (the collection in this case)
## store serialized state variable instance as returned by UDA state function
# for final function
## deserialize state variable instance
## call UDA final function with deserialized state variable
# return UDA final value

Superfluous re-serialization is addressed in CASSANDRA-9613. So the flow would then be:
# Create state variable instance (non-serialized, a "real" object)
# for each row
## Call state function with state variable object and row's column value
## store state variable object returned from state function
# for final function
## Call final function with state variable object
# serialize state variable or final function's return value

But for unmodifiable collections, the UDA's state variable has to do something like this:
public List myStateFunction(List state, String value)
  state = new ArrayList(state); // <-- THIS ONE
  return state;
This can become quite expensive (CPU and garbage) if the UDA's being used on a partition with
several hundred/thousand rows - especially if people use bigger maps, store more intermediate
results, etc, etc.

OTOH this will also be true for tuples and UDTs as these always (de)serialize for every get/set,
which is also imperfect IMO (but not something to be addressed in the driver).

So, long wall of text...
*TL;DR* - Having said that tuples and UDTs do serialization, I'd like to address that in CASSANDRA-9613
to also prevent that. So my plan would be: resolve this as "duplicate" of 9613 and fix it
there for UDAs. But I'm still unsure if returning an unmodifiable collection is a good idea
in the driver. 

> Aggregate with Initial Condition fails with C* 3.0
> --------------------------------------------------
>                 Key: CASSANDRA-10367
>                 URL:
>             Project: Cassandra
>          Issue Type: Bug
>         Environment: Cassandra 3.0 branch
>            Reporter: Greg Bestland
>            Assignee: Robert Stupp
>             Fix For: 3.0.x
> I'm seeing some inconsistent behavior between  2.2 and 3.0 C* with regards to UDF, Aggregates
and Initial Conditions. I have a scenario, which I think is valid. It works in C* 2.2 but
not in 3.0
> Using the following user defined function
> {code:sql}
> CREATE OR REPLACE FUNCTION extend_list(s list<text>, i int)
>                                   CALLED ON NULL INPUT
>                                   RETURNS list<text>
>                                   LANGUAGE java AS 'if (i != null) s.add(String.valueOf(i));
return s;';
> {code}
> With the aggregate below
> {code:sql}
> CREATE AGGREGATE aggregatemetadata.test_init_cond_aggregate(int) SFUNC extend_list STYPE
list<text> INITCOND [  ]
> {code}
> When I attempt to exercise the aggregate on from a simple key value table.
> {code:sql}
> SELECT test_init_cond_aggregate(v) AS list_res FROM t
> {code}
> in 2.2 it works fine and returns the aggregate.
> The exact same test ran against the 3.0 branch produces the following exception from
the server.
> {code:java}
> InvalidRequest: code=2200 [Invalid query] message="ERROR FUNCTION_FAILURE: execution
of 'aggregatemetadata.extend_list[list<text>, int]' failed: java.lang.UnsupportedOperationException"
> {code}
> I've grepped through the C* logs but I couldn't find a more verbose stack trace, or any
> Robert Stupp suggested I open a ticket.
> I am able to reproduce both in the python driver manually using cql.

This message was sent by Atlassian JIRA

View raw message