Return-Path: X-Original-To: apmail-cassandra-user-archive@www.apache.org Delivered-To: apmail-cassandra-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 4C15A182F2 for ; Mon, 15 Feb 2016 00:31:56 +0000 (UTC) Received: (qmail 83469 invoked by uid 500); 15 Feb 2016 00:31:53 -0000 Delivered-To: apmail-cassandra-user-archive@cassandra.apache.org Received: (qmail 83434 invoked by uid 500); 15 Feb 2016 00:31:53 -0000 Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@cassandra.apache.org Delivered-To: mailing list user@cassandra.apache.org Received: (qmail 83421 invoked by uid 99); 15 Feb 2016 00:31:53 -0000 Received: from Unknown (HELO spamd2-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 15 Feb 2016 00:31:53 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd2-us-west.apache.org (ASF Mail Server at spamd2-us-west.apache.org) with ESMTP id B337E1A025C for ; Mon, 15 Feb 2016 00:31:52 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd2-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 1.279 X-Spam-Level: * X-Spam-Status: No, score=1.279 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, HTML_MESSAGE=2, RCVD_IN_DNSWL_LOW=-0.7, RCVD_IN_MSPIKE_H3=-0.01, RCVD_IN_MSPIKE_WL=-0.01, SPF_PASS=-0.001] autolearn=disabled Authentication-Results: spamd2-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=sysdig-com.20150623.gappssmtp.com Received: from mx2-lw-us.apache.org ([10.40.0.8]) by localhost (spamd2-us-west.apache.org [10.40.0.9]) (amavisd-new, port 10024) with ESMTP id EwenKAJ1epnO for ; Mon, 15 Feb 2016 00:31:50 +0000 (UTC) Received: from mail-wm0-f50.google.com (mail-wm0-f50.google.com [74.125.82.50]) by mx2-lw-us.apache.org (ASF Mail Server at mx2-lw-us.apache.org) with ESMTPS id 40BC85FE33 for ; Mon, 15 Feb 2016 00:31:50 +0000 (UTC) Received: by mail-wm0-f50.google.com with SMTP id c200so91126681wme.0 for ; Sun, 14 Feb 2016 16:31:50 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=sysdig-com.20150623.gappssmtp.com; s=20150623; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :content-type; bh=uHDtJtenvE2D52H9D6PXMIxTTR9jd8hsGSdyZDyxl0I=; b=wiUSxm5+O/eSzqZu0JxzGvgQ0aYOjDOoY6uwaoSRSHzsXSNQV4vzYgHMrgJ9weIKix g1Yogz/sj6bGzOMLZ6DVg0exW8qdg+sMqZ4bb9swrBCRlajIvk2yquV8cDxi7oN0H294 B+MdkFT4YOl4GknI2PEXr0O/WfJiOWwsVxoNtUAbHaFgwAHtkox6AW4TwW78bPAWQwxq RmwL315P0lN23AYOMKfRgoaK5dTwcd1vQ0V5hFFjbStXrO2Lq/R0DWWyaTtBp6vaipbN 1m1nbQQ712qJjcsz4/kHz/0l0Gwd9PCmMAQSRHLyXUOGmgV2cjUDfiDeT/npEtYQpXlC uceA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:in-reply-to:references:from:date :message-id:subject:to:content-type; bh=uHDtJtenvE2D52H9D6PXMIxTTR9jd8hsGSdyZDyxl0I=; b=VMAwBJDBCbDmYsvznPWhhDxKUSimuHm2A40NtpJjEDIQZhAcksChGCV3+ZOYzh1Ny6 E8lMBMMei4IkV5f4vCLg82GIB2OpX200S8zIPe+NCnsGvqTLsdgF4j6NpQxtkv4cSRIL EBszNe1VivPcQm/6dV66F3oXeYsDFuTRGQky8WWZw9AEyUIk6vBEojlG0BD4OYedzp42 K1Y7v0nUVBweYzJB3eoAGsLIYsZ0XWNu3zJPfXIsfP4xZJVbqfciO5gYWQLliZR9S8xb M8HRfshTlfEiZ2Lp+p2WIi69fELOMWAbI8pjpb0OlrBCvookeETXjC7Z3b3p0V4aefK/ YxEA== X-Gm-Message-State: AG10YOQulL+iL3gAoHux7yjZuu7XWI7NaJjKlbS11OQsxs8cRxJyogGx80wp1jLp6AAF89Bhs6OYII7ZPuUFmQ3Y X-Received: by 10.28.215.16 with SMTP id o16mr9164867wmg.57.1455496309059; Sun, 14 Feb 2016 16:31:49 -0800 (PST) MIME-Version: 1.0 Received: by 10.27.89.138 with HTTP; Sun, 14 Feb 2016 16:31:09 -0800 (PST) In-Reply-To: References: From: Gianluca Borello Date: Sun, 14 Feb 2016 16:31:09 -0800 Message-ID: Subject: Re: Performance issues with "many" CQL columns To: user@cassandra.apache.org Content-Type: multipart/alternative; boundary=001a1145ba2c794b1e052bc424fb --001a1145ba2c794b1e052bc424fb Content-Type: text/plain; charset=UTF-8 Thank you for your reply. Your advice is definitely sound, although it still seems suboptimal to me because: 1) It requires N INSERT queries from the application code (where N is the number of columns) 2) It requires N SELECT queries from my application code (where N is the number of columns I need to read at any given time, which is determined at runtime). I can't even use the IN operator (e.g. WHERE column_number IN (1, 2, 3, ...)) because I am already using a non-EQ relation on the timestamp key and Cassandra restricts me to only one non-EQ relation. In summary, I can (and will) adapt my code to use a similar approach despite everything, but the goal of my message was mainly to understand why the jira issues I linked above are not full of dozens of "+1" comments. To me this really feels like a terrible performance issue that should be fixed by default (or in the very worst case clearly documented), even after understanding the motivation for reading all the columns in the CQL row. Thanks On Sun, Feb 14, 2016 at 3:05 PM, Jack Krupansky wrote: > You could add the column number as an additional clustering key. And then > you can actually use COMPACT STORAGE for even more efficient storage and > access (assuming there is only a single non-PK data column, the blob > value.) You can then access (read or write) an individual column/blob or a > slice of them. > > -- Jack Krupansky > > On Sun, Feb 14, 2016 at 5:22 PM, Gianluca Borello > wrote: > >> Hi >> >> I've just painfully discovered a "little" detail in Cassandra: Cassandra >> touches all columns on a CQL select (related issues >> https://issues.apache.org/jira/browse/CASSANDRA-6586, >> https://issues.apache.org/jira/browse/CASSANDRA-6588, >> https://issues.apache.org/jira/browse/CASSANDRA-7085). >> >> My data model is fairly simple: I have a bunch of "sensors" reporting a >> blob of data (~10-100KB) periodically. When reading, 99% of the times I'm >> interested in a subportion of that blob of data across an arbitrary period >> of time. What I do is simply splitting those blobs of data in about 30 >> logical units and write them in a CQL table such as: >> >> create table data ( >> id bigint, >> ts bigint, >> column1 blob, >> column2 blob, >> column3 blob, >> ... >> column29 blob, >> column30 blob >> primary key (id, ts) >> >> id is a combination of the sensor id and a time bucket, in order to not >> get the row too wide. Essentially, I thought this was a very legit data >> model that helps me keep my application code very simple (because I can >> work on a single table, I can write a split sensor blob in a single CQL >> query and I can read a subset of the columns very efficiently with one >> single CQL query). >> >> What I didn't realize is that Cassandra seems to always process all the >> columns of the CQL row, regardless of the fact that my query asks just one >> column, and this has dramatic effect on the performance of my reads. >> >> I wrote a simple isolated test case where I test how long it takes to >> read one *single* column in a CQL table composed of several columns (at >> each iteration I add and populate 10 new columns), each filled with 1MB >> blobs: >> >> 10 columns: 209 ms >> 20 columns: 339 ms >> 30 columns: 510 ms >> 40 columns: 670 ms >> 50 columns: 884 ms >> 60 columns: 1056 ms >> 70 columns: 1527 ms >> 80 columns: 1503 ms >> 90 columns: 1600 ms >> 100 columns: 1792 ms >> >> In other words, even if the result set returned is exactly the same >> across all these iteration, the response time increases linearly with the >> size of the other columns, and this is really causing a lot of problems in >> my application. >> >> By reading the JIRA issues, it seems like this is considered a very minor >> optimization not worth the effort of fixing, so I'm asking: is my use case >> really so anomalous that the horrible performance that I'm experiencing are >> to be considered "expected" and need to be fixed with some painful column >> family splitting and messy application code? >> >> Thanks >> > > --001a1145ba2c794b1e052bc424fb Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable
Thank you for your reply.

Your advice i= s definitely sound, although it still seems suboptimal to me because:
=

1) It requires N INSERT queries from the application co= de (where N is the number of columns)

2) It requir= es N SELECT queries from my application code (where N is the number of colu= mns I need to read at any given time, which is determined at runtime). I ca= n't even use the IN operator (e.g. WHERE column_number IN (1, 2, 3, ...= )) because I am already using a non-EQ relation on the timestamp key and Ca= ssandra restricts me to only one non-EQ relation.

= In summary, I can (and will) adapt my code to use a similar approach despit= e everything, but the goal of my message was mainly to understand why the j= ira issues I linked above are not full of dozens of "+1" comments= .=C2=A0

To me this really feels like a terrible pe= rformance issue that should be fixed by default (or in the very worst case = clearly documented), even after understanding the motivation for reading al= l the columns in the CQL row.

Thanks

On Sun, Feb 14, 201= 6 at 3:05 PM, Jack Krupansky <jack.krupansky@gmail.com> wrote:
You could add = the column number as an additional clustering key. And then you can actuall= y use COMPACT STORAGE for even more efficient storage and access (assuming = there is only =C2=A0a single non-PK data column, the blob value.) You can t= hen access (read or write) an individual column/blob or a slice of them.

-- Jack Krupansky

On Sun, Feb 14, 2016 at 5:22 PM, Gianluca Bo= rello <gianluca@sysdig.com> wrote:
Hi

I'= ;ve just painfully discovered a "little" detail in Cassandra: Cas= sandra touches all columns on a CQL select (related issues=C2=A0htt= ps://issues.apache.org/jira/browse/CASSANDRA-6586,https://issue= s.apache.org/jira/browse/CASSANDRA-6588,=C2=A0https://issues.ap= ache.org/jira/browse/CASSANDRA-7085).

My data model is fairly simpl= e: I have a bunch of "sensors" reporting a blob of data (~10-100K= B) periodically. When reading, 99% of the times I'm interested in a sub= portion of that blob of data across an arbitrary period of time. What I do = is simply splitting those blobs of data in about 30 logical units and write= them in a CQL table such as:

create table data (
id bigint,
ts bigint,
column1 blob,
column2 blob,
column3 blo= b,
...
column29 blob,
column30 blob
primary key= (id, ts)

id is a combination of the sensor id and a time bucket, in or= der to not get the row too wide. Essentially, I thought this was a very leg= it data model that helps me keep my application code very simple (because I= can work on a single table, I can write a split sensor blob in a single CQ= L query and I can read a subset of the columns very efficiently with one si= ngle CQL query).

What I didn't realize is that Cassandra seems to = always process all the columns of the CQL row, regardless of the fact that = my query asks just one column, and this has dramatic effect on the performa= nce of my reads.=C2=A0

I wrote a simple isolated test case where I test= how long it takes to read one *single* column in a CQL table composed of s= everal columns (at each iteration I add and populate 10 new columns), each = filled with 1MB blobs:

10 columns: 209 ms
20 columns: 339 ms
30 columns:= 510 ms
40 columns: 670 ms
50 columns: 884 ms
60 columns: 1056 ms
70 columns:= 1527 ms
80 columns: 1503 ms
90 columns: 1600 ms
100 columns: 1792 ms

In other words, even if the result set r= eturned is exactly the same across all these iteration, the response time i= ncreases linearly with the size of the other columns, and this is really ca= using a lot of problems in my application.

By reading the JIRA issues, = it seems like this is considered a very minor optimization not worth the ef= fort of fixing, so I'm asking: is my use case really so anomalous that = the horrible performance that I'm experiencing are to be considered &qu= ot;expected" and need to be fixed with some painful column family spli= tting and messy application code?

=
Thanks


--001a1145ba2c794b1e052bc424fb--