From cassandra-user-return-2426-apmail-incubator-cassandra-user-archive=incubator.apache.org@incubator.apache.org Tue Feb 02 19:01:58 2010 Return-Path: Delivered-To: apmail-incubator-cassandra-user-archive@minotaur.apache.org Received: (qmail 94225 invoked from network); 2 Feb 2010 19:01:58 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 2 Feb 2010 19:01:58 -0000 Received: (qmail 4101 invoked by uid 500); 2 Feb 2010 19:01:57 -0000 Delivered-To: apmail-incubator-cassandra-user-archive@incubator.apache.org Received: (qmail 4062 invoked by uid 500); 2 Feb 2010 19:01:57 -0000 Mailing-List: contact cassandra-user-help@incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: cassandra-user@incubator.apache.org Delivered-To: mailing list cassandra-user@incubator.apache.org Received: (qmail 4052 invoked by uid 99); 2 Feb 2010 19:01:57 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 02 Feb 2010 19:01:57 +0000 X-ASF-Spam-Status: No, hits=1.2 required=10.0 tests=SPF_NEUTRAL X-Spam-Check-By: apache.org Received-SPF: neutral (nike.apache.org: local policy) Received: from [209.85.221.185] (HELO mail-qy0-f185.google.com) (209.85.221.185) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 02 Feb 2010 19:01:49 +0000 Received: by qyk15 with SMTP id 15so363295qyk.20 for ; Tue, 02 Feb 2010 11:01:28 -0800 (PST) MIME-Version: 1.0 Received: by 10.143.25.1 with SMTP id c1mr4199769wfj.17.1265137287549; Tue, 02 Feb 2010 11:01:27 -0800 (PST) In-Reply-To: References: Date: Tue, 2 Feb 2010 11:01:27 -0800 Message-ID: Subject: Re: get_slice() slow if more number of columns present in a SCF. From: Nathan McCall To: cassandra-user@incubator.apache.org Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable X-Virus-Checked: Checked by ClamAV on apache.org Thank you for the benchmarks. What version of Cassandra are you using? I had about 80% performance improvement on single node reads after using a trunk build with the results from https://issues.apache.org/jira/browse/CASSANDRA-688 (result caching) and playing around with the configuration. I am not yet running this in production though, so I cannot provide any real numbers. That said, I have no intention of deploying a single node. I keep seing performance concerns from folks on small or single node clusters. My impression so far is that Cassandra may not be the right solution for these types of deployments. My main interest in Cassandra is the linear scalability of reads and writes. From my own tests and some of the discussion on these lists, it seems Cassandra can thrash around a lot when the number of nodes <=3D the replication factor * 2, particularly if a node goes down. I understand this is a design trade-off of sorts and I am fine with it. Any sort of distributed, fault tolerant system is well served by using lots of commodity hardware. What I found to have been most valuable for my evaluation was to get a good test together with some real data from our system and then add nodes, remove nodes, break nodes, etc. and watch what happens. Once I finish with this, it looks like I will have some solid numbers to do some capacity planning for figuring out exactly how much hardware to purchase and when I will need to add more. Apologies to the original poster if that got a little long winded, but hopefully it will be useful information for folks. Cheers, -Nate On Tue, Feb 2, 2010 at 7:27 AM, envio user wrote: > All, > > Here are some tests[batch_insert() and get_slice()] I performed on cassan= dra. > > H/W: Single node, Quad Core(8 cores), 8GB RAM: > Two separate physical disks, one for the commit log and another for the d= ata. > > storage-conf.xml > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > 0.4 > 256 > 128 > 0.2 > 1440 > 16 > > > Data Model: > > CompareSubcolumnsWith=3D"UTF8Type" Name=3D"Super1" /> > > TEST1A > =3D=3D=3D=3D=3D=3D > /home/sun>python stress.py -n 100000 -t 100 -y super -u 1 -c 25 -r -o > insert -i 10 > WARNING: multiprocessing not present, threading will be used. > =A0 =A0 =A0 =A0Benchmark may not be accurate! > total,interval_op_rate,avg_latency,elapsed_time > 19039,1903,0.0532085509215,10 > 52052,3301,0.0302550313445,20 > 82274,3022,0.0330235137811,30 > 100000,1772,0.0337765234716,40 > > TEST1B > =3D=3D=3D=3D=3D > /home/sun>python stress.py -n 100000 -t 100 -y super -u 1 -c 25 -r -o rea= d -i 10 > WARNING: multiprocessing not present, threading will be used. > =A0 =A0 =A0 =A0Benchmark may not be accurate! > total,interval_op_rate,avg_latency,elapsed_time > 16472,1647,0.0615632034523,10 > 39375,2290,0.04384300123,20 > 65259,2588,0.0385473697268,30 > 91613,2635,0.0379411213277,40 > 100000,838,0.0331208069702,50 > /home/sun> > > > **** I deleted all the data(all: commitlog,data..) and restarted cassandr= a.*** > I am ok with TEST1A and TEST1B. I want to populate the SCF with > 500 > columns and read 25 columns per key. > > TEST2A > =3D=3D=3D=3D=3D=3D > /home/sun>python stress.py -n 100000 -t 100 -y super -u 1 -c 600 -r -o > insert -i 10 > WARNING: multiprocessing not present, threading will be used. > =A0 =A0 =A0 =A0Benchmark may not be accurate! > total,interval_op_rate,avg_latency,elapsed_time > ............. > ............. > 84216,144,0.689481827031,570 > 85768,155,0.625061393859,580 > 87307,153,0.648041650953,590 > 88785,147,0.671928719674,600 > 90488,170,0.611753724284,610 > 91983,149,0.677673689896,620 > 93490,150,0.63891824366,630 > 95017,152,0.65472143182,640 > 96612,159,0.64355712789,650 > 98098,148,0.673311280851,660 > 99622,152,0.486848112166,670 > 100000,37,0.174115514629,680 > > I understand nobody will write 600 columns at a time. I just need to > populate the data, hence I did this test. > > [root@fc10mc1 ~]# ls -l /var/lib/cassandra/commitlog/ > total 373880 > -rw-r--r-- 1 root root 268462742 2010-02-03 02:00 CommitLog-1265141714717= .log > -rw-r--r-- 1 root root 114003919 2010-02-03 02:00 CommitLog-1265142593543= .log > > [root@fc10mc1 ~]# ls -l /cassandra/lib/cassandra/data/Keyspace1/ > total 3024232 > -rw-r--r-- 1 root root 1508524822 2010-02-03 02:00 Super1-192-Data.db > -rw-r--r-- 1 root root =A0 =A0 =A092725 2010-02-03 02:00 Super1-192-Filte= r.db > -rw-r--r-- 1 root root =A0 =A02639957 2010-02-03 02:00 Super1-192-Index.d= b > -rw-r--r-- 1 root root =A0100838971 2010-02-03 02:02 Super1-279-Data.db > -rw-r--r-- 1 root root =A0 =A0 =A0 8725 2010-02-03 02:02 Super1-279-Filte= r.db > -rw-r--r-- 1 root root =A0 =A0 176481 2010-02-03 02:02 Super1-279-Index.d= b > -rw-r--r-- 1 root root 1478775337 2010-02-03 02:03 Super1-280-Data.db > -rw-r--r-- 1 root root =A0 =A0 =A090805 2010-02-03 02:03 Super1-280-Filte= r.db > -rw-r--r-- 1 root root =A0 =A02588072 2010-02-03 02:03 Super1-280-Index.d= b > [root@fc10mc1 ~]# > > [root@fc10mc1 ~]# du -hs /cassandra/lib/cassandra/data/Keyspace1/ > 2.9G =A0 =A0/cassandra/lib/cassandra/data/Keyspace1/ > > > TEST2B > =3D=3D=3D=3D=3D=3D > > /home/sun>python stress.py -n 100000 -t 100 -y super -u 1 -c 25 -r -o rea= d -i 10 > WARNING: multiprocessing not present, threading will be used. > =A0 =A0 =A0 =A0Benchmark may not be accurate! > total,interval_op_rate,avg_latency,elapsed_time > ................. > ................ > > 66962,382,0.261044957001,180 > 70598,363,0.276139952824,190 > 74490,389,0.25678327989,200 > 78252,376,0.263047518976,210 > 82031,377,0.266485546846,220 > 86008,397,0.248498579411,230 > 89699,369,0.274926948857,240 > 93590,389,0.256867142883,250 > 97328,373,0.267352432985,260 > 100000,267,0.217604277555,270 > > This test is more worrying for us. We can't even read 1000 reads per > second. Is there any limitation on cassandra, which will not work with > more number of columns ?. =A0Or mm I doing something wrong here?. Please > let me know. > > Attached are the nodeprobe(tpstats), iostats, and vmstats taken for the t= ests. > > > thanks in advance, > -Aita > > > some changes I made to stress.py to accomadate more columns. > > 157c156 > < =A0 =A0 =A0 =A0 columns =3D [Column('A' + str(j), data, 0) for j in > xrange(columns_per_key)] > --- >> =A0 =A0 =A0 =A0 columns =3D [Column(chr(ord('A') + j), data, 0) for j in= xrange(columns_per_key)] > 159c158 > < =A0 =A0 =A0 =A0 =A0 =A0 supers =3D [SuperColumn('A' + str(j), columns) = for j in > xrange(supers_per_key)] > --- >> =A0 =A0 =A0 =A0 =A0 =A0 supers =3D [SuperColumn(chr(ord('A') + j), colum= ns) for j in xrange(supers_per_key)] > 187c186 > < =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 parent =3D ColumnParent('Super1= ', 'A' + str(j)) > --- >> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 parent =3D ColumnParent('Super1'= , chr(ord('A') + j)) >