cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jack Krupansky" <j...@basetechnology.com>
Subject Re: JSON to Cassandra ?
Date Tue, 22 Jul 2014 14:30:01 GMT
Sounds like user-defined types (UDF) in Cassandra 2.1:
https://issues.apache.org/jira/browse/CASSANDRA-5590

But... be careful to make sure that you aren’t using this powerful (and dangerous) feature
as a crutch merely to avoid disciplined data modeling.

-- Jack Krupansky

From: Alain RODRIGUEZ 
Sent: Tuesday, July 22, 2014 9:56 AM
To: user@cassandra.apache.org 
Subject: JSON to Cassandra ?

Hi guys, I know this topic as already been spoken many times, and I read a lot of these discussions.


Yet, I have not been able to find a good way to do what I want.

We are receiving messages from our app that is a complex, dynamic, nested JSON (can be a few
or thousands of attributes). JSON is variable and can contain nested arrays or sub-JSONs.

Please, consider this example:

JSON

{
    "struct-id": 141241321,
    "nested-1-1": {
        "value-1-1-1": "36d1f74d-1663-418d-8b1b-665bbb2d9ecb",
        "value-1-1-2": 5,
        "value-1-1-3": 0.5,
        "value-1-1-4": ["foo", "bar", "foobar"],
        "nested-2-1": {
            "test-2-1-1": "whatever",
            "test-2-1-2": 42
        }
    },
    "nested-1-2": {
        "value-1-2-1": [{
            "id": 1,
            "deeply-nested": {
                "data-1": "test",
                "data-2": 4023
            }
        },
        {
            "id": 2,
            "data-3": "that's enough data"
        }]
    }
}

We would like to store those messages to Cassandra and then run SPARK jobs over it. Basically,
storing it as a text (full JSON in one column) would work but wouldn't be optimised since
I might want to count how many times "value-1-1-3" is bigger or equal to 1, I would have to
read all the JSON before answering this. I read a lot of things about people using composite
columns and dynamic composite columns, but no precise example. I am also aware of collections
support, yet nested collections are not supported currently.

I would like to have:

- 1 column per attribute
- typed values
- something that would be able to parse and store any valid JSON (with nested arrays of JSON
or whatever).
- The most efficient model to use alongside with spark to query anything inside.

What would be the possible CQL schemas to create such a data structure ?

What are the defaults of the following schema ?

Cassandra

CREATE TABLE test-schema (
    struct-id int,
    nested-1-1#value-1-1-1 string,
    nested-1-1#value-1-1-2 int,
    nested-1-1#value-1-1-3 float,
    nested-1-1#value-1-1-4#array0 string,
    nested-1-1#value-1-1-4#array1 string,
    nested-1-1#value-1-1-4#array2 string,
    nested-1-1#nested-2-1#test-2-1-1 string,
    nested-1-1#nested-2-1#test-2-1-2 int,
    nested-1-2#value-1-2-1#array0#id int,
    nested-1-2#value-1-2-1#array0#deeply-nested#data-1 string,
    nested-1-2#value-1-2-1#array0#deeply-nested#data-2 int,
    nested-1-2#id int,
    nested-1-2#data-3 string,
    PRIMARY KEY (struct-id)
)

I could use:

    nested-1-1#value-1-1-4 list<string>,


instead of:

    nested-1-1#value-1-1-4#array0 string,
    nested-1-1#value-1-1-4#array1 string,
    nested-1-1#value-1-1-4#array2 string,

yet it wouldn't work here:

    nested-1-2#value-1-2-1#array0#deeply-nested#data-1 string,
    nested-1-2#value-1-2-1#array0#deeply-nested#data-2 int,
    nested-1-2#value-1-2-1#array1#id int,
    nested-1-2#value-1-2-1#array1#data-3 string,

since this is a nested structure inside the list.



To create this schema, could we imagine that the app logging this try to write to the corresponding
column, for each JSON attribute, and if the column is missing, catch the error, create the
column and reprocess write ?

This exception would happen for each new field, only once and would modify the schema.

Any thought that would help us (and probably more people) ?

Alain
Mime
View raw message