From user-return-28820-apmail-cassandra-user-archive=cassandra.apache.org@cassandra.apache.org Mon Sep 17 09:07:01 2012 Return-Path: X-Original-To: apmail-cassandra-user-archive@www.apache.org Delivered-To: apmail-cassandra-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 2618FD78A for ; Mon, 17 Sep 2012 09:07:01 +0000 (UTC) Received: (qmail 5707 invoked by uid 500); 17 Sep 2012 09:06:58 -0000 Delivered-To: apmail-cassandra-user-archive@cassandra.apache.org Received: (qmail 5623 invoked by uid 500); 17 Sep 2012 09:06:58 -0000 Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@cassandra.apache.org Delivered-To: mailing list user@cassandra.apache.org Received: (qmail 5615 invoked by uid 99); 17 Sep 2012 09:06:58 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 17 Sep 2012 09:06:58 +0000 X-ASF-Spam-Status: No, hits=2.2 required=5.0 tests=FSL_RCVD_USER,HTML_MESSAGE,SPF_HELO_PASS,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of andre.cruz@co.sapo.pt designates 212.55.140.19 as permitted sender) Received: from [212.55.140.19] (HELO sl.pt) (212.55.140.19) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 17 Sep 2012 09:06:52 +0000 Received: (qmail 13794 invoked from network); 17 Sep 2012 09:06:31 -0000 X-AntiVirus: PTMail-AV 0.3-0.95.1 X-Scan-Status: AV clean (0.11506 seconds); Received: from unknown (HELO [10.134.132.140]) (andre.cruz@co.sapo.pt@[10.134.132.140]) (envelope-sender ) by mail-sl-pt01 (qmail-ldap-1.03) with AES128-SHA encrypted SMTP for ; 17 Sep 2012 09:06:31 -0000 From: =?iso-8859-1?Q?Andr=E9_Cruz?= Content-Type: multipart/alternative; boundary="Apple-Mail=_5317A0B2-7F9F-4897-A96E-253817D4A873" Message-Id: <42CC7F7E-2C38-4BA7-9C61-5263C951B9C9@co.sapo.pt> Mime-Version: 1.0 (Mac OS X Mail 6.0 \(1486\)) Subject: Re: Query advice to prevent node overload Date: Mon, 17 Sep 2012 10:06:30 +0100 References: To: user@cassandra.apache.org In-Reply-To: X-Mailer: Apple Mail (2.1486) --Apple-Mail=_5317A0B2-7F9F-4897-A96E-253817D4A873 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset=iso-8859-1 On Sep 17, 2012, at 3:04 AM, aaron morton = wrote: >> I have a schema that represents a filesystem and one example of a = Super CF is: > This may help with some ideas > http://www.datastax.com/dev/blog/cassandra-file-system-design >=20 > In general we advise to avoid Super Columns if possible. They are = often slower, and the sub columns are not indexed. Meaning all the sub = columns have to be read into memory.=20 >=20 >=20 >> So if I set column_count =3D 10000, as I have now, but fetch 1000 = dirs (rows) and each one happens to have 10000 files (columns) the = dataset is 1000x10000. > This is the way the query works internally. Multiget is simply a = collections of independent gets.=20 >=20 > =20 >> The multiget() is more efficient, but I'm having trouble trying to = limit the size of the data returned in order to not crash the cassandra = node. > Often less is more. I would only ask for a few 10's of rows at a time, = or try to limit the size of the returned query to a few MB's. Otherwise = a lot of data get's dragged through cassandra, the network and finally = Python.=20 >=20 > You may want to consider a CF like the inode CF it the article above. = Where the parent dir is a column with a secondary index.=20 Thanks Aaron! I will take your points into consideration. Best regards, Andr=E9 --Apple-Mail=_5317A0B2-7F9F-4897-A96E-253817D4A873 Content-Transfer-Encoding: quoted-printable Content-Type: text/html; charset=iso-8859-1 aaron@thelastpickle.com> = wrote:
I have a schema that represents a filesystem = and one example of a Super CF is:
This may help = with some ideas
htt= p://www.datastax.com/dev/blog/cassandra-file-system-design

In general we advise to avoid Super Columns if possible. They = are often slower, and the sub columns are not indexed. Meaning all the = sub columns have to be read into = memory. 


So if I set column_count =3D 10000, as I have now, but = fetch 1000 dirs (rows) and each one happens to have 10000 files = (columns) the dataset is 1000x10000.
This is the way the = query works internally. Multiget is simply a collections of independent = gets. 

 
The multiget() is more efficient, but I'm having trouble = trying to limit the size of the data returned in order to not crash the = cassandra node.
Often less is more. I would only ask for a = few 10's of rows at a time, or try to limit the size of the returned = query to a few MB's. Otherwise a lot of data get's dragged through = cassandra, the network and finally = Python. 

You may want to consider a CF = like the inode CF it the article above. Where the parent dir is a column = with a secondary = index. 

Thanks Aaron! I will = take your points into consideration.

Best = regards,
Andr=E9

= --Apple-Mail=_5317A0B2-7F9F-4897-A96E-253817D4A873--