Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@cassandra.apache.org
MIME-Version: 1.0
From: Gianluca Borello <gianluca@sysdig.com>
Date: Sun, 14 Feb 2016 14:22:20 -0800
Message-ID: 
 <CAJjpQyTS2eaCcRBVa=ZmM-hcBX5nF4ovC1enW+SFfGwvngOi7g@mail.gmail.com>
Subject: Performance issues with "many" CQL columns
To: user@cassandra.apache.org
Content-Type: multipart/alternative; boundary=047d7bf0d2ecc5c293052bc25755

--047d7bf0d2ecc5c293052bc25755
Content-Type: text/plain; charset=UTF-8

Hi

I've just painfully discovered a "little" detail in Cassandra: Cassandra
touches all columns on a CQL select (related issues
https://issues.apache.org/jira/browse/CASSANDRA-6586,
https://issues.apache.org/jira/browse/CASSANDRA-6588,
https://issues.apache.org/jira/browse/CASSANDRA-7085).

My data model is fairly simple: I have a bunch of "sensors" reporting a
blob of data (~10-100KB) periodically. When reading, 99% of the times I'm
interested in a subportion of that blob of data across an arbitrary period
of time. What I do is simply splitting those blobs of data in about 30
logical units and write them in a CQL table such as:

create table data (
id bigint,
ts bigint,
column1 blob,
column2 blob,
column3 blob,
...
column29 blob,
column30 blob
primary key (id, ts)

id is a combination of the sensor id and a time bucket, in order to not get
the row too wide. Essentially, I thought this was a very legit data model
that helps me keep my application code very simple (because I can work on a
single table, I can write a split sensor blob in a single CQL query and I
can read a subset of the columns very efficiently with one single CQL
query).

What I didn't realize is that Cassandra seems to always process all the
columns of the CQL row, regardless of the fact that my query asks just one
column, and this has dramatic effect on the performance of my reads.

I wrote a simple isolated test case where I test how long it takes to read
one *single* column in a CQL table composed of several columns (at each
iteration I add and populate 10 new columns), each filled with 1MB blobs:

10 columns: 209 ms
20 columns: 339 ms
30 columns: 510 ms
40 columns: 670 ms
50 columns: 884 ms
60 columns: 1056 ms
70 columns: 1527 ms
80 columns: 1503 ms
90 columns: 1600 ms
100 columns: 1792 ms

In other words, even if the result set returned is exactly the same across
all these iteration, the response time increases linearly with the size of
the other columns, and this is really causing a lot of problems in my
application.

By reading the JIRA issues, it seems like this is considered a very minor
optimization not worth the effort of fixing, so I'm asking: is my use case
really so anomalous that the horrible performance that I'm experiencing are
to be considered "expected" and need to be fixed with some painful column
family splitting and messy application code?

Thanks

--047d7bf0d2ecc5c293052bc25755
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr"><div style=3D"font-size:12.8px">Hi</div><div style=3D"font=
-size:12.8px"><br></div><div style=3D"font-size:12.8px">I&#39;ve just painf=
ully discovered a &quot;little&quot; detail in Cassandra: Cassandra touches=
 all columns on a CQL select (related issues=C2=A0<a href=3D"https://issues=
.apache.org/jira/browse/CASSANDRA-6586" target=3D"_blank">https://issues.ap=
ache.org/jira/browse/CASSANDRA-6586</a>,<a href=3D"https://issues.apache.or=
g/jira/browse/CASSANDRA-6588" target=3D"_blank">https://issues.apache.org/j=
ira/browse/CASSANDRA-6588</a>,=C2=A0<a href=3D"https://issues.apache.org/ji=
ra/browse/CASSANDRA-7085" target=3D"_blank">https://issues.apache.org/jira/=
browse/CASSANDRA-7085</a>).</div><div style=3D"font-size:12.8px"><br></div>=
<div style=3D"font-size:12.8px">My data model is fairly simple: I have a bu=
nch of &quot;sensors&quot; reporting a blob of data (~10-100KB) periodicall=
y. When reading, 99% of the times I&#39;m interested in a subportion of tha=
t blob of data across an arbitrary period of time. What I do is simply spli=
tting those blobs of data in about 30 logical units and write them in a CQL=
 table such as:</div><div style=3D"font-size:12.8px"><br></div><div style=
=3D"font-size:12.8px">create table data (</div><div style=3D"font-size:12.8=
px"><span style=3D"white-space:pre-wrap">	</span>id bigint,</div><div style=
=3D"font-size:12.8px"><span style=3D"white-space:pre-wrap">	</span>ts bigin=
t,</div><div style=3D"font-size:12.8px"><span style=3D"white-space:pre-wrap=
">	</span>column1 blob,</div><div style=3D"font-size:12.8px"><span style=3D=
"white-space:pre-wrap">	</span>column2 blob,</div><div style=3D"font-size:1=
2.8px"><span style=3D"white-space:pre-wrap">	</span>column3 blob,</div><div=
 style=3D"font-size:12.8px"><span style=3D"white-space:pre-wrap">	</span>..=
.</div><div style=3D"font-size:12.8px"><span style=3D"white-space:pre-wrap"=
>	</span>column29 blob,</div><div style=3D"font-size:12.8px"><span style=3D=
"white-space:pre-wrap">	</span>column30 blob</div><div style=3D"font-size:1=
2.8px"><span style=3D"white-space:pre-wrap">	</span>primary key (id, ts)</d=
iv><div style=3D"font-size:12.8px"><br></div><div style=3D"font-size:12.8px=
">id is a combination of the sensor id and a time bucket, in order to not g=
et the row too wide. Essentially, I thought this was a very legit data mode=
l that helps me keep my application code very simple (because I can work on=
 a single table, I can write a split sensor blob in a single CQL query and =
I can read a subset of the columns very efficiently with one single CQL que=
ry).</div><div style=3D"font-size:12.8px"><br></div><div style=3D"font-size=
:12.8px">What I didn&#39;t realize is that Cassandra seems to always proces=
s all the columns of the CQL row, regardless of the fact that my query asks=
 just one column, and this has dramatic effect on the performance of my rea=
ds.=C2=A0</div><div style=3D"font-size:12.8px"><br></div><div style=3D"font=
-size:12.8px">I wrote a simple isolated test case where I test how long it =
takes to read one *single* column in a CQL table composed of several column=
s (at each iteration I add and populate 10 new columns), each filled with 1=
MB blobs:</div><div style=3D"font-size:12.8px"><br></div><div style=3D"font=
-size:12.8px">10 columns: 209 ms</div><div style=3D"font-size:12.8px">20 co=
lumns: 339 ms</div><div style=3D"font-size:12.8px">30 columns: 510 ms</div>=
<div style=3D"font-size:12.8px">40 columns: 670 ms</div><div style=3D"font-=
size:12.8px">50 columns: 884 ms</div><div style=3D"font-size:12.8px">60 col=
umns: 1056 ms</div><div style=3D"font-size:12.8px">70 columns: 1527 ms</div=
><div style=3D"font-size:12.8px">80 columns: 1503 ms</div><div style=3D"fon=
t-size:12.8px">90 columns: 1600 ms</div><div style=3D"font-size:12.8px">100=
 columns: 1792 ms</div><div style=3D"font-size:12.8px"><br></div><div style=
=3D"font-size:12.8px">In other words, even if the result set returned is ex=
actly the same across all these iteration, the response time increases line=
arly with the size of the other columns, and this is really causing a lot o=
f problems in my application.</div><div style=3D"font-size:12.8px"><br></di=
v><div style=3D"font-size:12.8px">By reading the JIRA issues, it seems like=
 this is considered a very minor optimization not worth the effort of fixin=
g, so I&#39;m asking: is my use case really so anomalous that the horrible =
performance that I&#39;m experiencing are to be considered &quot;expected&q=
uot; and need to be fixed with some painful column family splitting and mes=
sy application code?</div><div style=3D"font-size:12.8px"><br></div><div st=
yle=3D"font-size:12.8px">Thanks</div></div>

--047d7bf0d2ecc5c293052bc25755--