Return-Path: X-Original-To: apmail-cassandra-user-archive@www.apache.org Delivered-To: apmail-cassandra-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 3BF021802A for ; Sun, 14 Feb 2016 22:23:11 +0000 (UTC) Received: (qmail 54677 invoked by uid 500); 14 Feb 2016 22:23:03 -0000 Delivered-To: apmail-cassandra-user-archive@cassandra.apache.org Received: (qmail 54635 invoked by uid 500); 14 Feb 2016 22:23:03 -0000 Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@cassandra.apache.org Delivered-To: mailing list user@cassandra.apache.org Received: (qmail 54625 invoked by uid 99); 14 Feb 2016 22:23:02 -0000 Received: from Unknown (HELO spamd3-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 14 Feb 2016 22:23:02 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd3-us-west.apache.org (ASF Mail Server at spamd3-us-west.apache.org) with ESMTP id 8892118027D for ; Sun, 14 Feb 2016 22:23:02 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd3-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 1.279 X-Spam-Level: * X-Spam-Status: No, score=1.279 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, HTML_MESSAGE=2, RCVD_IN_DNSWL_LOW=-0.7, RCVD_IN_MSPIKE_H3=-0.01, RCVD_IN_MSPIKE_WL=-0.01, SPF_PASS=-0.001] autolearn=disabled Authentication-Results: spamd3-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=sysdig-com.20150623.gappssmtp.com Received: from mx1-eu-west.apache.org ([10.40.0.8]) by localhost (spamd3-us-west.apache.org [10.40.0.10]) (amavisd-new, port 10024) with ESMTP id a9eE62-vd9uU for ; Sun, 14 Feb 2016 22:23:00 +0000 (UTC) Received: from mail-wm0-f53.google.com (mail-wm0-f53.google.com [74.125.82.53]) by mx1-eu-west.apache.org (ASF Mail Server at mx1-eu-west.apache.org) with ESMTPS id 10C4525996 for ; Sun, 14 Feb 2016 22:23:00 +0000 (UTC) Received: by mail-wm0-f53.google.com with SMTP id g62so34554144wme.1 for ; Sun, 14 Feb 2016 14:23:00 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=sysdig-com.20150623.gappssmtp.com; s=20150623; h=mime-version:from:date:message-id:subject:to:content-type; bh=d2hUI5Ky5DruSUOOx97ra+nyFVigX5Q78Hd3VESOT9U=; b=LkvEEHd3GdlT+EUCqhL4RQ/yEgXZmZ5aKjBlC+OK7R4RyD0I0zSiQ/WRkUkF3z/yhT zKEWfumafEzxBOObi6081B0bA8xeRbSD6PuTZpRTx1TN9oazq+ujGFL9dgYI2AcF63H4 d9onInvOhRSEPzsTWkhGxziTs/bvwZYXQoG/0UToIN6A+IDVSbeuFlGaMGJEmalLS9KY alGDaI/WbQn3w70cyFqQq6yb5czh/9hYwZytjdDXZwfyCBaEVAO6ZWVkMKzxac6QoP+n ktmaEOT20QAhYJG7R04524xD4d5JMXugRzE4+ZQc3ivU+FtagWl+FG4LXlLYL/XNxSOM DGZQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:from:date:message-id:subject:to :content-type; bh=d2hUI5Ky5DruSUOOx97ra+nyFVigX5Q78Hd3VESOT9U=; b=YbyUci62+v2+cMxXXwxARSPCV6Yk/QqkWZK3B6xDN8S/eRR4Py6bw1Uqk1ukSx1Uhw zT+3ZGPjZThI01+oIB+JLl6sOD7KYBWkGsygkXB1eH8q091am3+4+E5MCz+Lhgswkkpy DEOce34LXW9bPvWQIHYIn1Lc8iJeLXcJm8HPvGPggiuASVYyroko5aU1DMbaFsR633I8 puTKCwDKtW4Z0wcPAuos5clSpQrrM2pOq0yV5+Jliy1eSiGAdgB0N7c3zU7PCbGHgmEn gCXVVVXsBHqhXK1YCeDwcAnj8MOJIOUfCA43YegVd6ZjdKV2O71YD+ZYjcqErf8gj6ZV 2Fjw== X-Gm-Message-State: AG10YOSldDTXy0YWkQVveJVNwpXv4pN7iVJGTdWFH7Vpi6JjiVdhwRKX/DQIp1qaujZtYNnS3Zz2cMiYhMz41fJC X-Received: by 10.194.77.137 with SMTP id s9mr12310189wjw.171.1455488579773; Sun, 14 Feb 2016 14:22:59 -0800 (PST) MIME-Version: 1.0 Received: by 10.27.89.138 with HTTP; Sun, 14 Feb 2016 14:22:20 -0800 (PST) From: Gianluca Borello Date: Sun, 14 Feb 2016 14:22:20 -0800 Message-ID: Subject: Performance issues with "many" CQL columns To: user@cassandra.apache.org Content-Type: multipart/alternative; boundary=047d7bf0d2ecc5c293052bc25755 --047d7bf0d2ecc5c293052bc25755 Content-Type: text/plain; charset=UTF-8 Hi I've just painfully discovered a "little" detail in Cassandra: Cassandra touches all columns on a CQL select (related issues https://issues.apache.org/jira/browse/CASSANDRA-6586, https://issues.apache.org/jira/browse/CASSANDRA-6588, https://issues.apache.org/jira/browse/CASSANDRA-7085). My data model is fairly simple: I have a bunch of "sensors" reporting a blob of data (~10-100KB) periodically. When reading, 99% of the times I'm interested in a subportion of that blob of data across an arbitrary period of time. What I do is simply splitting those blobs of data in about 30 logical units and write them in a CQL table such as: create table data ( id bigint, ts bigint, column1 blob, column2 blob, column3 blob, ... column29 blob, column30 blob primary key (id, ts) id is a combination of the sensor id and a time bucket, in order to not get the row too wide. Essentially, I thought this was a very legit data model that helps me keep my application code very simple (because I can work on a single table, I can write a split sensor blob in a single CQL query and I can read a subset of the columns very efficiently with one single CQL query). What I didn't realize is that Cassandra seems to always process all the columns of the CQL row, regardless of the fact that my query asks just one column, and this has dramatic effect on the performance of my reads. I wrote a simple isolated test case where I test how long it takes to read one *single* column in a CQL table composed of several columns (at each iteration I add and populate 10 new columns), each filled with 1MB blobs: 10 columns: 209 ms 20 columns: 339 ms 30 columns: 510 ms 40 columns: 670 ms 50 columns: 884 ms 60 columns: 1056 ms 70 columns: 1527 ms 80 columns: 1503 ms 90 columns: 1600 ms 100 columns: 1792 ms In other words, even if the result set returned is exactly the same across all these iteration, the response time increases linearly with the size of the other columns, and this is really causing a lot of problems in my application. By reading the JIRA issues, it seems like this is considered a very minor optimization not worth the effort of fixing, so I'm asking: is my use case really so anomalous that the horrible performance that I'm experiencing are to be considered "expected" and need to be fixed with some painful column family splitting and messy application code? Thanks --047d7bf0d2ecc5c293052bc25755 Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable
Hi

I've just painf= ully discovered a "little" detail in Cassandra: Cassandra touches= all columns on a CQL select (related issues=C2=A0https://issues.ap= ache.org/jira/browse/CASSANDRA-6586,https://issues.apache.org/j= ira/browse/CASSANDRA-6588,=C2=A0https://issues.apache.org/jira/= browse/CASSANDRA-7085).

=
My data model is fairly simple: I have a bu= nch of "sensors" reporting a blob of data (~10-100KB) periodicall= y. When reading, 99% of the times I'm interested in a subportion of tha= t blob of data across an arbitrary period of time. What I do is simply spli= tting those blobs of data in about 30 logical units and write them in a CQL= table such as:

create table data (
id bigint,
ts bigin= t,
column1 blob,
column2 blob,
column3 blob,
..= .
column29 blob,
column30 blob
primary key (id, ts)

id is a combination of the sensor id and a time bucket, in order to not g= et the row too wide. Essentially, I thought this was a very legit data mode= l that helps me keep my application code very simple (because I can work on= a single table, I can write a split sensor blob in a single CQL query and = I can read a subset of the columns very efficiently with one single CQL que= ry).

What I didn't realize is that Cassandra seems to always proces= s all the columns of the CQL row, regardless of the fact that my query asks= just one column, and this has dramatic effect on the performance of my rea= ds.=C2=A0

I wrote a simple isolated test case where I test how long it = takes to read one *single* column in a CQL table composed of several column= s (at each iteration I add and populate 10 new columns), each filled with 1= MB blobs:

10 columns: 209 ms
20 co= lumns: 339 ms
30 columns: 510 ms
=
40 columns: 670 ms
50 columns: 884 ms
60 col= umns: 1056 ms
70 columns: 1527 ms
80 columns: 1503 ms
90 columns: 1600 ms
100= columns: 1792 ms

In other words, even if the result set returned is ex= actly the same across all these iteration, the response time increases line= arly with the size of the other columns, and this is really causing a lot o= f problems in my application.

By reading the JIRA issues, it seems like= this is considered a very minor optimization not worth the effort of fixin= g, so I'm asking: is my use case really so anomalous that the horrible = performance that I'm experiencing are to be considered "expected&q= uot; and need to be fixed with some painful column family splitting and mes= sy application code?

Thanks
--047d7bf0d2ecc5c293052bc25755--