Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@cassandra.apache.org
Received-SPF: pass (nike.apache.org: domain of arodrime@gmail.com designates
 209.85.217.172 as permitted sender)
MIME-Version: 1.0
In-Reply-To: <A19709CDF35A4225A4D825D8BD1D1ABB@JackKrupansky14>
References: 
 <CA+VSrLpG4aKVk+jiMMQTOQede7J0e+_b+PCmN00yXZoHWQZLVA@mail.gmail.com>
 <A19709CDF35A4225A4D825D8BD1D1ABB@JackKrupansky14>
From: Alain RODRIGUEZ <arodrime@gmail.com>
Date: Tue, 22 Jul 2014 17:29:12 +0200
Message-ID: 
 <CA+VSrLqJF9=R9D4Mx1-c69NWq+OL7skZPePRxaunPSOJ8exiSw@mail.gmail.com>
Subject: Re: JSON to Cassandra ?
To: user@cassandra.apache.org
Content-Type: multipart/alternative; boundary=089e013d161eed8ed204fec9e324

--089e013d161eed8ed204fec9e324
Content-Type: text/plain; charset=ISO-8859-1

Hi,

This seems to fit, even if I would need to have to look on how these fields
can be queried and indexed. Also, I would need to see if those UDF can be
modified once created and how they behave in this use case.

Yet, 2.1 is currently in beta, and we won't switch to this version
immediately (even if we could take profit of this and improved counters
also...) since we are using C*1.2 and are giving a try at DSE 4.5. In both
cases, we are far from using 2.1. How does people use to do this without
UDF ?

Thanks for the pointer though, will probably help someday :-).


2014-07-22 16:30 GMT+02:00 Jack Krupansky <jack@basetechnology.com>:

>   Sounds like user-defined types (UDF) in Cassandra 2.1:
> https://issues.apache.org/jira/browse/CASSANDRA-5590
>
> But... be careful to make sure that you aren't using this powerful (and
> dangerous) feature as a crutch merely to avoid disciplined data modeling.
>
> -- Jack Krupansky
>
>  *From:* Alain RODRIGUEZ <arodrime@gmail.com>
> *Sent:* Tuesday, July 22, 2014 9:56 AM
> *To:* user@cassandra.apache.org
> *Subject:* JSON to Cassandra ?
>
>  Hi guys, I know this topic as already been spoken many times, and I read
> a lot of these discussions.
>
> Yet, I have not been able to find a good way to do what I want.
>
> We are receiving messages from our app that is a complex, dynamic, nested
> JSON (can be a few or thousands of attributes). JSON is variable and can
> contain nested arrays or sub-JSONs.
>
> Please, consider this example:
>
>  JSON
>
> {
>     "struct-id": 141241321,
>     "nested-1-1": {
>         "value-1-1-1": "36d1f74d-1663-418d-8b1b-665bbb2d9ecb",
>         "value-1-1-2": 5,
>         "value-1-1-3": 0.5,
>         "value-1-1-4": ["foo", "bar", "foobar"],
>         "nested-2-1": {
>             "test-2-1-1": "whatever",
>             "test-2-1-2": 42
>         }
>     },
>     "nested-1-2": {
>         "value-1-2-1": [{
>             "id": 1,
>             "deeply-nested": {
>                 "data-1": "test",
>                 "data-2": 4023
>             }
>         },
>         {
>             "id": 2,
>             "data-3": "that's enough data"
>         }]
>     }
> }
>
> We would like to store those messages to Cassandra and then run SPARK jobs
> over it. Basically, storing it as a text (full JSON in one column) would
> work but wouldn't be optimised since I might want to count how many times
> "value-1-1-3" is bigger or equal to 1, I would have to read all the JSON
> before answering this. I read a lot of things about people using composite
> columns and dynamic composite columns, but no precise example. I am also
> aware of collections support, yet nested collections are not supported
> currently.
>
> I would like to have:
>
> - 1 column per attribute
> - typed values
> - something that would be able to parse and store any valid JSON (with
> nested arrays of JSON or whatever).
> - The most efficient model to use alongside with spark to query anything
> inside.
>
> What would be the possible CQL schemas to create such a data structure ?
>
> What are the defaults of the following schema ?
>
>  Cassandra
>
> CREATE TABLE test-schema (
>     struct-id int,
>     nested-1-1#value-1-1-1 string,
>     nested-1-1#value-1-1-2 int,
>     nested-1-1#value-1-1-3 float,
>     nested-1-1#value-1-1-4#array0 string,
>     nested-1-1#value-1-1-4#array1 string,
>     nested-1-1#value-1-1-4#array2 string,
>     nested-1-1#nested-2-1#test-2-1-1 string,
>     nested-1-1#nested-2-1#test-2-1-2 int,
>     nested-1-2#value-1-2-1#array0#id int,
>     nested-1-2#value-1-2-1#array0#deeply-nested#data-1 string,
>     nested-1-2#value-1-2-1#array0#deeply-nested#data-2 int,
>     nested-1-2#id int,
>     nested-1-2#data-3 string,
>     PRIMARY KEY (struct-id)
> )
>
> I could use:
>
>     nested-1-1#value-1-1-4 list<string>,
>
> instead of:
>
>      nested-1-1#value-1-1-4#array0 string,
>     nested-1-1#value-1-1-4#array1 string,
>     nested-1-1#value-1-1-4#array2 string,
>
> yet it wouldn't work here:
>
>      nested-1-2#value-1-2-1#array0#deeply-nested#data-1 string,
>     nested-1-2#value-1-2-1#array0#deeply-nested#data-2 int,
>     nested-1-2#value-1-2-1#array1#id int,
>     nested-1-2#value-1-2-1#array1#data-3 string,
>
> since this is a nested structure inside the list.
>
>
>
> To create this schema, could we imagine that the app logging this try to
> write to the corresponding column, for each JSON attribute, and if the
> column is missing, catch the error, create the column and reprocess write ?
>
> This exception would happen for each new field, only once and would modify
> the schema.
>
> Any thought that would help us (and probably more people) ?
>
> Alain
>

--089e013d161eed8ed204fec9e324
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr">Hi,<div><br></div><div>This seems to fit, even if I would =
need to have to look on how these fields can be queried and indexed. Also, =
I would need to see if those UDF can be modified once created and how they =
behave in this use case.</div>

<div><br></div><div>Yet, 2.1 is currently in beta, and we won&#39;t switch =
to this version immediately (even if we could take profit of this and impro=
ved counters also...) since we are using C*1.2 and are giving a try at DSE =
4.5. In both cases, we are far from using 2.1. How does people use to do th=
is without UDF ?</div>

<div><br></div><div>Thanks for the pointer though, will probably help somed=
ay :-).</div></div><div class=3D"gmail_extra"><br><br><div class=3D"gmail_q=
uote">2014-07-22 16:30 GMT+02:00 Jack Krupansky <span dir=3D"ltr">&lt;<a hr=
ef=3D"mailto:jack@basetechnology.com" target=3D"_blank">jack@basetechnology=
.com</a>&gt;</span>:<br>

<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex">
<div dir=3D"ltr">
<div dir=3D"ltr">
<div style=3D"FONT-SIZE:12pt;FONT-FAMILY:&#39;Calibri&#39;;COLOR:#000000">
<div>Sounds like user-defined types (UDF) in Cassandra 2.1:</div>
<div><a title=3D"https://issues.apache.org/jira/browse/CASSANDRA-5590" href=
=3D"https://issues.apache.org/jira/browse/CASSANDRA-5590" target=3D"_blank"=
>https://issues.apache.org/jira/browse/CASSANDRA-5590</a></div>
<div>&nbsp;</div>
<div>But... be careful to make sure that you aren&rsquo;t using this powerf=
ul (and=20
dangerous) feature as a crutch merely to avoid disciplined data modeling.</=
div>
<div>&nbsp;</div>
<div style=3D"FONT-SIZE:12pt;FONT-FAMILY:&#39;Calibri&#39;;COLOR:#000000">-=
- Jack=20
Krupansky</div>
<div style=3D"FONT-SIZE:small;TEXT-DECORATION:none;FONT-FAMILY:&quot;Calibr=
i&quot;;FONT-WEIGHT:normal;COLOR:#000000;FONT-STYLE:normal;DISPLAY:inline">
<div style=3D"FONT:10pt tahoma">
<div>&nbsp;</div>
<div style=3D"BACKGROUND:#f5f5f5">
<div><b>From:</b> <a title=3D"arodrime@gmail.com" href=3D"mailto:arodrime@g=
mail.com" target=3D"_blank">Alain RODRIGUEZ</a> </div>
<div><b>Sent:</b> Tuesday, July 22, 2014 9:56 AM</div>
<div><b>To:</b> <a title=3D"user@cassandra.apache.org" href=3D"mailto:user@=
cassandra.apache.org" target=3D"_blank">user@cassandra.apache.org</a> </div=
>
<div><b>Subject:</b> JSON to Cassandra ?</div></div></div>
<div>&nbsp;</div></div><div><div class=3D"h5">
<div style=3D"FONT-SIZE:small;TEXT-DECORATION:none;FONT-FAMILY:&quot;Calibr=
i&quot;;FONT-WEIGHT:normal;COLOR:#000000;FONT-STYLE:normal;DISPLAY:inline">
<div dir=3D"ltr">Hi guys, I know this topic as already been spoken many tim=
es, and I=20
read a lot of these discussions.=20
<div>&nbsp;</div>
<div>Yet, I have not been able to find a good way to do what I want.</div>
<div>&nbsp;</div>
<div>We are receiving messages from our app that is a complex, dynamic, nes=
ted=20
JSON (can be a few or thousands of attributes). JSON is variable and can co=
ntain=20
nested arrays or sub-JSONs.</div>
<div>&nbsp;</div>
<div>Please, consider this example:</div>
<div>&nbsp;</div>
<div>
<div>JSON</div>
<div>&nbsp;</div>
<div>{</div>
<div>&nbsp;&nbsp;&nbsp; &quot;struct-id&quot;: 141241321,</div>
<div>&nbsp;&nbsp;&nbsp; &quot;nested-1-1&quot;: {</div>
<div>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; &quot;value-1-1-1&quot;:=20
&quot;36d1f74d-1663-418d-8b1b-665bbb2d9ecb&quot;,</div>
<div>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; &quot;value-1-1-2&quot;: 5,=
</div>
<div>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; &quot;value-1-1-3&quot;: 0.=
5,</div>
<div>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; &quot;value-1-1-4&quot;: [&=
quot;foo&quot;, &quot;bar&quot;,=20
&quot;foobar&quot;],</div>
<div>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; &quot;nested-2-1&quot;: {</=
div>
<div>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;=20
&quot;test-2-1-1&quot;: &quot;whatever&quot;,</div>
<div>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;=20
&quot;test-2-1-2&quot;: 42</div>
<div>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; }</div>
<div>&nbsp;&nbsp;&nbsp; },</div>
<div>&nbsp;&nbsp;&nbsp; &quot;nested-1-2&quot;: {</div>
<div>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; &quot;value-1-2-1&quot;: [{=
</div>
<div>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; &qu=
ot;id&quot;:=20
1,</div>
<div>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;=20
&quot;deeply-nested&quot;: {</div>
<div>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbs=
p;&nbsp;&nbsp;&nbsp;=20
&quot;data-1&quot;: &quot;test&quot;,</div>
<div>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbs=
p;&nbsp;&nbsp;&nbsp;=20
&quot;data-2&quot;: 4023</div>
<div>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; }</=
div>
<div>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; },</div>
<div>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; {</div>
<div>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; &qu=
ot;id&quot;:=20
2,</div>
<div>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;=20
&quot;data-3&quot;: &quot;that&#39;s enough data&quot;</div>
<div>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; }]</div>
<div>&nbsp;&nbsp;&nbsp; }</div>
<div>}</div></div>
<div>&nbsp;</div>
<div>We would like to store those messages to Cassandra and then run SPARK =
jobs=20
over it. Basically, storing it as a text (full JSON in one column) would wo=
rk=20
but wouldn&#39;t be optimised since I might want to count how many times=20
&quot;value-1-1-3&quot; is bigger or equal to 1, I would have to read all t=
he JSON before=20
answering this. I read a lot of things about people using composite columns=
 and=20
dynamic composite columns, but no precise example. I am also aware of=20
collections support, yet nested collections are not supported currently.</d=
iv>
<div>&nbsp;</div>
<div>I would like to have:</div>
<div>&nbsp;</div>
<div>- 1 column per attribute</div>
<div>- typed values</div>
<div>- something that would be able to parse and store any valid JSON (with=
=20
nested arrays of JSON or whatever).</div>
<div>- The most efficient model to use alongside with spark to query anythi=
ng=20
inside.</div>
<div>&nbsp;</div>
<div>What would be the possible CQL schemas to create such a data structure=
=20
?</div>
<div>&nbsp;</div>
<div>What are the defaults of the following schema ?</div>
<div>&nbsp;</div>
<div>
<div>Cassandra</div>
<div>&nbsp;</div>
<div>CREATE TABLE test-schema (</div>
<div>&nbsp;&nbsp;&nbsp; struct-id int,</div>
<div>&nbsp;&nbsp;&nbsp; nested-1-1#value-1-1-1 string,</div>
<div>&nbsp;&nbsp;&nbsp; nested-1-1#value-1-1-2 int,</div>
<div>&nbsp;&nbsp;&nbsp; nested-1-1#value-1-1-3 float,</div>
<div>&nbsp;&nbsp;&nbsp; nested-1-1#value-1-1-4#array0 string,</div>
<div>&nbsp;&nbsp;&nbsp; nested-1-1#value-1-1-4#array1 string,</div>
<div>&nbsp;&nbsp;&nbsp; nested-1-1#value-1-1-4#array2 string,</div>
<div>&nbsp;&nbsp;&nbsp; nested-1-1#nested-2-1#test-2-1-1 string,</div>
<div>&nbsp;&nbsp;&nbsp; nested-1-1#nested-2-1#test-2-1-2 int,</div>
<div>&nbsp;&nbsp;&nbsp; nested-1-2#value-1-2-1#array0#id int,</div>
<div>&nbsp;&nbsp;&nbsp; nested-1-2#value-1-2-1#array0#deeply-nested#data-1=
=20
string,</div>
<div>&nbsp;&nbsp;&nbsp; nested-1-2#value-1-2-1#array0#deeply-nested#data-2=
=20
int,</div>
<div>&nbsp;&nbsp;&nbsp; nested-1-2#id int,</div>
<div>&nbsp;&nbsp;&nbsp; nested-1-2#data-3 string,</div>
<div>&nbsp;&nbsp;&nbsp; PRIMARY KEY (struct-id)</div>
<div>)</div></div>
<div>&nbsp;</div>
<div>I could use:</div>
<div>&nbsp;</div>
<div>&nbsp;&nbsp;&nbsp; nested-1-1#value-1-1-4 list&lt;string&gt;,<br></div=
>
<div>&nbsp;</div>
<div>instead of:</div>
<div>&nbsp;</div>
<div>
<div>&nbsp;&nbsp;&nbsp; nested-1-1#value-1-1-4#array0 string,</div>
<div>&nbsp;&nbsp;&nbsp; nested-1-1#value-1-1-4#array1 string,</div>
<div>&nbsp;&nbsp;&nbsp; nested-1-1#value-1-1-4#array2 string,</div></div>
<div>&nbsp;</div>
<div>yet it wouldn&#39;t work here:</div>
<div>&nbsp;</div>
<div>
<div>&nbsp;&nbsp;&nbsp; nested-1-2#value-1-2-1#array0#deeply-nested#data-1=
=20
string,</div>
<div>&nbsp;&nbsp;&nbsp; nested-1-2#value-1-2-1#array0#deeply-nested#data-2=
=20
int,</div>
<div>&nbsp;&nbsp;&nbsp; nested-1-2#value-1-2-1#array1#id int,</div>
<div>&nbsp;&nbsp;&nbsp; nested-1-2#value-1-2-1#array1#data-3 string,</div><=
/div>
<div>&nbsp;</div>
<div>since this is a nested structure inside the list.</div>
<div>&nbsp;</div>
<div>&nbsp;</div>
<div>&nbsp;</div>
<div>To create this schema, could we imagine that the app logging this try =
to=20
write to the corresponding column, for each JSON attribute, and if the colu=
mn is=20
missing, catch the error, create the column and reprocess write ?</div>
<div>&nbsp;</div>
<div>This exception would happen for each new field, only once and would mo=
dify=20
the schema.</div>
<div>&nbsp;</div>
<div>Any thought that would help us (and probably more people) ?</div>
<div>&nbsp;</div>
<div>Alain</div></div></div></div></div></div></div></div>
</blockquote></div><br></div>

--089e013d161eed8ed204fec9e324--