cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Robert Stupp (JIRA)" <>
Subject [jira] [Commented] (CASSANDRA-6710) Support union types
Date Mon, 06 Jul 2015 18:06:04 GMT


Robert Stupp commented on CASSANDRA-6710:

I’ve just prepared two proposals for _union_ type in C*. Nothing there in form of code -
just thoughts. Both proposals differ by _how_ a union is declared as a column type - i.e.
with or without declaring the possible component-types up-front. Both have their own charm,
but I think the downsides of the 2nd variant are too dangerous.

h3. General

* A union must occupy exactly one cell/atom (i.e. no splitting of union-type in one and union-value
in another cell). That way there’s no need to special case the _null_ value for a _union_
- a union _is_ null or _is not_ null - nothing like _union is present but its value is null_.
* Can be used in primary key columns (like current collection and tuple/user types can)
* Although unions can be ”emulated“ using a tuple type, using a tuple would violate the
”contract“ of a union (or _Either_ type - see description above). IMO this justifies adding
a _union_ type to C*
* Must comply to CASSANDRA-6717 and should be tackled after 6717 has landed

h3. Approach 1 - predefined set of types in union

This approach declares the possible types in the union up-front. From a data-modeling point-of-view
it is clear what _can_ be in that union. It should also help with mapping _union_ to _Either_
types in functional programming languages like Scala and therefore Spark.

CREATE TYPE bar ( a text, b int );
  pk int PRIMARY KEY,
  my_union union<int, bigint, timeuuid, text, frozen<bar>, frozen<set<bar>>>

The schema definition would contain the (ordered) list of possible _component-types_ per column
in a table declared using a _union_. That way all _component-types_ are indexed and can be
referenced from within any union’s value. Serialization of the union type includes an _index_
to a union’s declared component-type. By using a single byte, unions with up to 128 (0-based,
signed byte) components are theoretically possible - but honestly only a handful would be
relevant in practice.

The serialized format for a cell/atom would look like this:

| {{\[byte\]}} | component-index | references the n-th _component-type_ (0-based) in the declaration
of the union in the column or the containing table.
| {{\[bytes\]}} | data | serialized representation of the type - no need to handle nulls

*Optional*: {{ALTER TABLE foo ALTER my_union union<…>}} can *add* additional types
to a union, but never remove one. Whether or not to implement this, is more a matter of _if_
we should support that, so lying in the area of _data modeling best-practices_. I tend to
not implement this to be consistent with what’s possible with a tuple.

* Just one byte overhead compared to any _raw_ type.
* Has a ”strong” reference to contained UDTs (see alternative 2 below) as a ”usual”
column has. This ensures schema integrity and prohibits serialization errors (see alternative

* Only a predefined, but extensible set of types can be used. Honestly, this depends on one’s
personal favor.

h3. Approach 2 - union with _any_ type

This alternative approach gives complete freedom of which types a union may contain during
its whole lifetime. So it is completely contrary to what a C _union_ or an _Either_ does or
should do. It also implies some major downsides wrt UDTs.

CREATE TYPE bar ( a text, b int );
  pk int PRIMARY KEY,
  my_union union );

The serialized format for a cell/atom would look like this:

| {{\[string\]}} | type | cql3 type name
| {{\[bytes\]}} | data | serialized representation of the type - no need to handle nulls

* Very flexible by _which_ types can be used.

* Huge serialization overhead since the actual type must be serialized with the value. This
might be reduced by using something similar as Java does for type signatures - i.e. using
{{t}} for {{timeuuid}} and {{[;}} for a UDF.
* UDTs are not strongly referenced. Creating a UDT, using it in a union, dropping + recreating
a UDT with the same name but a different signature would likely cause serialization exceptions
* Fits more in the area in ”schema-less” that we want people to avoid.

h3. Native Protocol

Requires changes to the native protocol to data type serialization and schema-change notification
and schema-change result messages.

h3. Java Driver

Proposal for the Java Driver (non-binding, of course - incomplete pseudo-code):

public class UnionValue {
  public int getInt();
  public String getString();
  /* more primitives */
  public UDTValue getUDTValue(UserType userType);
  public TupleValue getTupleValue(TupleType tupleType);
  public <E> Set<E> getSet(Class<E> elementType);
  public <E> List<E> getList(Class<E> elementType);
  public <K,V> Map<K,V> getMap(Class<K> keyType, Class<V> valueType);
  /* low-level */
  public DataType getType();
  public ByteBuffer getRaw();

  public void setInt(int v);
  public void setString(String v);
  /* more primitives */
  public void setUDTValue(UDTValue udtValue);
  public void setTupleValue(TupleValue tupleValue);
  public <E> void setSet(Class<E> elementType, Set<E> set);
  public <E> void setList(Class<E> elementType, List<E> list);
  public <K,V> void setMap(Class<K> keyType, Class<V> valueType, Map<K,
V> map);
  /* low-level */
  public void setRaw(DataType type, ByteBuffer raw);

h3. cqlsh, Python Driver

There are obvious metadata, result set and statement enhancements in the Python Driver.
_cqlsh_ must also be able to format a union value depending on its actual type - so it adds
a dynamic indirection to {{cqlshlib.format_by_type}} beside syntax/completion enhancements.

> Support union types
> -------------------
>                 Key: CASSANDRA-6710
>                 URL:
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: API, Core
>            Reporter: Tupshin Harper
>            Priority: Minor
>              Labels: ponies
>             Fix For: 3.x
> I sometimes find myself wanting to abuse Cassandra datatypes when I want to interleave
two different types in the same column.
> An example is in CASSANDRA-6167 where an approach is to tag what would normally be a
numeric field with text indicating that it is special in some ways.
> A more elegant approach would be to be able to explicitly define disjoint unions in the
style of Haskell's and Scala's Either types.

This message was sent by Atlassian JIRA

View raw message