cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Robert Stupp (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (CASSANDRA-7395) Support for pure user-defined functions (UDF)
Date Wed, 09 Jul 2014 14:29:05 GMT

    [ https://issues.apache.org/jira/browse/CASSANDRA-7395?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14056291#comment-14056291
] 

Robert Stupp commented on CASSANDRA-7395:
-----------------------------------------

I like the approach to define (and code as supposed in CASSANDRA-7526) UDFs directly in CQL
although it requires to add UDFs to the system keyspace and implicitly require schema agreement
like tables, indexes, UDT etc. 
And if we agree that CASSANDRA-7526 is the way to do it right, then we must agree that Java
8 is required for C* 3.0 (except for the "pure Java" idea below).

Using something like {{CREATE FUNCTION sum(a bigint, b bigint) AS ( return a + b; )}} is much
easier to understand and to maintain than {{AS foo.bar.Class.method}}. Bundles could be implemented
like this:
{noformat}
CREATE BUNDLE Math AS (
  FUNCTION sum(a bigint, b bigint) {
    return a + b;
  }
);
{noformat}
But in opposite to use Nashorn in the first step, it would be possible to use "plain" Java
code using [Apache BCEL|https://commons.apache.org/proper/commons-bcel/] which does not have
the Java8 requirement. Adding the language as a parameter could look like {{FUNCTION sum(a
bigint, b bigint) AS JAVA ...}} or {{AS JAVASCRIPT}} or Groovy or whatever.

The _deterministic_ option was intended for use of UDFs in functional indexes - functional
indexes require deterministic methods whereas "normal" execution does not require deterministic
functions. So I'd like to keep this flag even in {{CREATE FUNCTION}} or {{CREATE BUNDLE ...
FUNCTION}} syntax, but assume deterministic or non-deterministic as a default.

As a conclusion a bundle in CQL syntax using BCEL could look like this:
{noformat}
CREATE OR UPDATE BUNDLE MyUDFs (
    FUNCTION double sin(input double) AS JAVA {
        return input == null ? null : Math.sin(input);
    }

    FUNCTION float sin(input float) AS JAVA {
        return input == null ? null : Math.sin(input);
    }

    NON DETERMINISTIC FUNCTION double random() AS JAVA {
        return Math.random();
    }
)
{noformat}

But we should keep some "backdoor" to pass the raw blob for a UDF - {{fooToBlob}} sounds straightforward,
if it's cheap. If it's not cheap, it's just possible and if demand is there, we can add a
special "raw" wildcard type for UDF parameters later.

UDFs could be held in a table : 
{noformat}
CREATE TABLE system.user_functions (
   bundle       text,       -- bundle name
   signature    text,       -- function name + argument types ; might be a MD5 hash of these
   name         text,       -- function name
   arguments    list<text>, -- list of CQL argument types
   return_type  text,       -- CQL return type
   language     text,       -- programming language
   body         text        -- code
   PRIMARY KEY ( ( bundle ), signature )
);
{noformat}

Altogether this one does not expose internals to UDFs and using/porting {{DataType}} + {{TypeCodec}}
+ {{CassandraTypeParser}} from the Java Driver to parse "complex" CQL types is not a big deal
- primitive types can be easily parsed using the {{CQL3Type.Native.valueOf(parsedTypeDef.toUpperCase())}}.

As a "marketing bullet list" :
* pure CQL functionality
* no C* internals exposed
* support for "pure Java" plus scripting languages
* type raw representation support (using {{fooToBlob}})
* no periodic polling of filesystem or system tables
* UDFs distributed "transparently" using schema agreement
* no tooling necessary - cqlsh and everything that supports CQL is enough
* UDF development help could be integrated for example in "DevCenter" that would itself compile
a UDF bundle and allows test / execution of individual functions - since it's based on Eclipse
it might be possible even to "debug" UDFs in Java and Nashorn supported scripting languages
- but that's stuff for another ticket...
* Access rules can be enforced using Java {{SecureClassLoader}} (UDF invocation surrounded
with {{Thread.setContextClassLoader(...)}})

Drawbacks:
* no official support to use external code
* cluster schema agreement on UDFs necessary
* changes of UDF bundles force compilation on each node - but that should not be a big issue
since UDFs should be small and efficient - they are not "full blown libraries"

I'm still not sure whether prepared statements must be invalidated if the bundle changes.
As long as a UDF with the same signature exists execution can continue - and if the bundle/function
is removed, execution will fail (which is ok).

Yes - I really like the "pure CQL" idea - simple to understand - easy for users to start with
- explanation would just need two bullet points on a slide. I think it's worth the BCEL and
schema agreement effort.

> Support for pure user-defined functions (UDF)
> ---------------------------------------------
>
>                 Key: CASSANDRA-7395
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-7395
>             Project: Cassandra
>          Issue Type: New Feature
>          Components: API, Core
>            Reporter: Jonathan Ellis
>              Labels: cql
>             Fix For: 3.0
>
>         Attachments: 7395-v2.diff, 7395.diff
>
>
> We have some tickets for various aspects of UDF (CASSANDRA-4914, CASSANDRA-5970, CASSANDRA-4998)
but they all suffer from various degrees of ocean-boiling.
> Let's start with something simple: allowing pure user-defined functions in the SELECT
clause of a CQL query.  That's it.
> By "pure" I mean, must depend only on the input parameters.  No side effects.  No exposure
to C* internals.  Column values in, result out.  http://en.wikipedia.org/wiki/Pure_function



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Mime
View raw message