cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Gianluca Borello <gianl...@sysdig.com>
Subject Re: Performance issues with "many" CQL columns
Date Mon, 15 Feb 2016 00:31:09 GMT
Thank you for your reply.

Your advice is definitely sound, although it still seems suboptimal to me
because:

1) It requires N INSERT queries from the application code (where N is the
number of columns)

2) It requires N SELECT queries from my application code (where N is the
number of columns I need to read at any given time, which is determined at
runtime). I can't even use the IN operator (e.g. WHERE column_number IN (1,
2, 3, ...)) because I am already using a non-EQ relation on the timestamp
key and Cassandra restricts me to only one non-EQ relation.

In summary, I can (and will) adapt my code to use a similar approach
despite everything, but the goal of my message was mainly to understand why
the jira issues I linked above are not full of dozens of "+1" comments.

To me this really feels like a terrible performance issue that should be
fixed by default (or in the very worst case clearly documented), even after
understanding the motivation for reading all the columns in the CQL row.

Thanks

On Sun, Feb 14, 2016 at 3:05 PM, Jack Krupansky <jack.krupansky@gmail.com>
wrote:

> You could add the column number as an additional clustering key. And then
> you can actually use COMPACT STORAGE for even more efficient storage and
> access (assuming there is only  a single non-PK data column, the blob
> value.) You can then access (read or write) an individual column/blob or a
> slice of them.
>
> -- Jack Krupansky
>
> On Sun, Feb 14, 2016 at 5:22 PM, Gianluca Borello <gianluca@sysdig.com>
> wrote:
>
>> Hi
>>
>> I've just painfully discovered a "little" detail in Cassandra: Cassandra
>> touches all columns on a CQL select (related issues
>> https://issues.apache.org/jira/browse/CASSANDRA-6586,
>> https://issues.apache.org/jira/browse/CASSANDRA-6588,
>> https://issues.apache.org/jira/browse/CASSANDRA-7085).
>>
>> My data model is fairly simple: I have a bunch of "sensors" reporting a
>> blob of data (~10-100KB) periodically. When reading, 99% of the times I'm
>> interested in a subportion of that blob of data across an arbitrary period
>> of time. What I do is simply splitting those blobs of data in about 30
>> logical units and write them in a CQL table such as:
>>
>> create table data (
>> id bigint,
>> ts bigint,
>> column1 blob,
>> column2 blob,
>> column3 blob,
>> ...
>> column29 blob,
>> column30 blob
>> primary key (id, ts)
>>
>> id is a combination of the sensor id and a time bucket, in order to not
>> get the row too wide. Essentially, I thought this was a very legit data
>> model that helps me keep my application code very simple (because I can
>> work on a single table, I can write a split sensor blob in a single CQL
>> query and I can read a subset of the columns very efficiently with one
>> single CQL query).
>>
>> What I didn't realize is that Cassandra seems to always process all the
>> columns of the CQL row, regardless of the fact that my query asks just one
>> column, and this has dramatic effect on the performance of my reads.
>>
>> I wrote a simple isolated test case where I test how long it takes to
>> read one *single* column in a CQL table composed of several columns (at
>> each iteration I add and populate 10 new columns), each filled with 1MB
>> blobs:
>>
>> 10 columns: 209 ms
>> 20 columns: 339 ms
>> 30 columns: 510 ms
>> 40 columns: 670 ms
>> 50 columns: 884 ms
>> 60 columns: 1056 ms
>> 70 columns: 1527 ms
>> 80 columns: 1503 ms
>> 90 columns: 1600 ms
>> 100 columns: 1792 ms
>>
>> In other words, even if the result set returned is exactly the same
>> across all these iteration, the response time increases linearly with the
>> size of the other columns, and this is really causing a lot of problems in
>> my application.
>>
>> By reading the JIRA issues, it seems like this is considered a very minor
>> optimization not worth the effort of fixing, so I'm asking: is my use case
>> really so anomalous that the horrible performance that I'm experiencing are
>> to be considered "expected" and need to be fixed with some painful column
>> family splitting and messy application code?
>>
>> Thanks
>>
>
>

Mime
View raw message