Return-Path: X-Original-To: apmail-cassandra-user-archive@www.apache.org Delivered-To: apmail-cassandra-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 6FCA9E44E for ; Fri, 18 Jan 2013 00:19:13 +0000 (UTC) Received: (qmail 51495 invoked by uid 500); 18 Jan 2013 00:19:10 -0000 Delivered-To: apmail-cassandra-user-archive@cassandra.apache.org Received: (qmail 51464 invoked by uid 500); 18 Jan 2013 00:19:10 -0000 Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@cassandra.apache.org Delivered-To: mailing list user@cassandra.apache.org Received: (qmail 51456 invoked by uid 99); 18 Jan 2013 00:19:10 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 18 Jan 2013 00:19:10 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of tyler@datastax.com designates 209.85.217.169 as permitted sender) Received: from [209.85.217.169] (HELO mail-lb0-f169.google.com) (209.85.217.169) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 18 Jan 2013 00:19:05 +0000 Received: by mail-lb0-f169.google.com with SMTP id m4so1077851lbo.0 for ; Thu, 17 Jan 2013 16:18:44 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20120113; h=mime-version:x-received:in-reply-to:references:date:message-id :subject:from:to:content-type:x-gm-message-state; bh=caEJX+pWhiUlIUICQxec/cy1TVOcm9aPHECu5zRBBYg=; b=fy960UOypFMvHtbL18qdSML57BmYgSXKB1zs692ThLsJHjU0DfxyQjVjSwWJfkFyfy AJIUDa0TSTXaeaNgQsx0CloftEe/JFX3LWaI20cwNFTa7mtfDWILUedoGuHSFrS4XaYd 4r6g679kPp+Ij09+zSeEi389veBKHSd8RRA4fM4TsqdIL2Yg8DABn/O5F9QwvcuZxxml Wz/DbiA0Ee6DhS9+bOfGCetqXrXEKR7cGF84LYj+56CGZ8QybgJg360Z1W/4596NuKJM ZTZKOFqLh5VQeDOV80feT2u6xNoFLqplXST4TzLPZb6vwWn+TUv53Mb8TL4a3eH+lujl +FJg== MIME-Version: 1.0 X-Received: by 10.112.47.168 with SMTP id e8mr2991665lbn.46.1358468324123; Thu, 17 Jan 2013 16:18:44 -0800 (PST) Received: by 10.112.44.168 with HTTP; Thu, 17 Jan 2013 16:18:44 -0800 (PST) In-Reply-To: References: Date: Thu, 17 Jan 2013 18:18:44 -0600 Message-ID: Subject: Re: Cassandra Performance Benchmarking. From: Tyler Hobbs To: user@cassandra.apache.org Content-Type: multipart/alternative; boundary=bcaec553fde0e5e41204d3850dc4 X-Gm-Message-State: ALoCoQlp9sG4AYpoQ6qHygJrX62OXc2T6+5D/dVX9XDU4WnaF00fJKA8lN2KS9GVBUHQoIPgTlsp X-Virus-Checked: Checked by ClamAV on apache.org --bcaec553fde0e5e41204d3850dc4 Content-Type: text/plain; charset=ISO-8859-1 ConnectionPools and ColumnFamilies are thread-safe in pycassa, and it's best to share them across multiple threads. Of course, when you do that, make sure to make the ConnectionPool large enough to support all of the threads making queries concurrently. I'm also not sure if you're just omitting this, but pycassa's ConnectionPool will only open connections to servers you explicitly include in server_list; there's no autodiscovery of other nodes going on. Depending on your network latency, you'll top out on python performance with a fairly low number of threads due to the GIL. It's best to use multiple processes if you really want to benchmark something. On Thu, Jan 17, 2013 at 6:05 PM, Pradeep Kumar Mantha wrote: > Hi, > > Thanks. I would like to benchmark cassandra with our application so > that we understand the details of how the actual benchmarking is done. > Not sure, how easy it would be to integrate YCSB with our application. > > So, i am trying different client interfaces to cassandra. > > I found > > for 12 Data Nodes Cassandra cluster and 1 Client Node which run 32 > threads ( each querying X number of queries ). > > cassandra-cli took 133 seconds > pycassa took 521 seconds. > > Here is the python pycassa code used to query and passed to each thread.... > > def start_cassandra_client(Threadname): > pool = pycassa.ConnectionPool('Blast', > server_list=['xxx.xx.xx.xx']) > cf = pycassa.ColumnFamily(pool, 'Blast_NR') > inp_file=open("pycassa_100%_query") > for key in inp_file: > key=key.strip() > cf.get(key) > > Does Java clients like Hector/Astynax help here.. I am more > comfortable with Python than Java and our existing application is also > in Python. > > thanks > pradeep > > > On Thu, Jan 17, 2013 at 2:08 PM, Edward Capriolo > wrote: > > Wow you managed to do a load test through the cassandra-cli. There > should be > > a merit badge for that. > > > > You should use the built in stress tool or YCSB. > > > > The CLI has to do much more string conversion then a normal client would > and > > it is not built for performance. You will definitely get better numbers > > through other means. > > > > On Thu, Jan 17, 2013 at 2:10 PM, Pradeep Kumar Mantha < > pradeepm66@gmail.com> > > wrote: > >> > >> Hi, > >> > >> I am trying to maximize execution of the number of read queries/second. > >> > >> Here is my cluster configuration. > >> > >> Replication - Default > >> 12 Data Nodes. > >> 16 Client Nodes - used for querying. > >> > >> Each client node executes 32 threads - each thread executes 76896 read > >> queries using cassandra-cli tool. > >> i.e all the read queries are stored in a file and that file is > >> given to cassandra-cli tool ( using -f option ) which is executed by a > >> thread. > >> so, total number of queries for 16 client Nodes is 16 * 32 * 76896. > >> > >> The read queries on each client node submitted at the same time. The > >> time taken for 16 * 32 * 76896 read queries is nearly 742 seconds - > >> which is nearly 53k transactions/second. > >> > >> I would like to know if there is any other way/tool through which I > >> can improve the number of transactions/second. > >> Is the performance affected by cassandra-cli tool? > >> > >> thanks > >> pradeep > > > > > -- Tyler Hobbs DataStax --bcaec553fde0e5e41204d3850dc4 Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable
ConnectionPools and ColumnFamilies are thread-safe in pyca= ssa, and it's best to share them across multiple threads.=A0 Of course,= when you do that, make sure to make the ConnectionPool large enough to sup= port all of the threads making queries concurrently.=A0 I'm also not su= re if you're just omitting this, but pycassa's ConnectionPool will = only open connections to servers you explicitly include in server_list; the= re's no autodiscovery of other nodes going on.

Depending on your network latency, you'll top out on python perform= ance with a fairly low number of threads due to the GIL.=A0 It's best t= o use multiple processes if you really want to benchmark something.


On Thu, Jan 1= 7, 2013 at 6:05 PM, Pradeep Kumar Mantha <pradeepm66@gmail.com><= /span> wrote:
Hi,

Thanks. I would like to benchmark cassandra with our application so
that we understand the details of how the actual benchmarking is done.
Not sure, how easy it would be to integrate YCSB with our application.

So, i am trying different client interfaces to cassandra.

I found

for 12 Data Nodes Cassandra cluster and 1 Client Node which run 32
threads ( each querying X number of queries ).

cassandra-cli =A0 =A0 took 133 seconds
pycassa took 521 seconds.

Here is the python pycassa code used to query and passed to each thread....=

def start_cassandra_client(Threadname):
=A0 =A0 =A0 =A0 pool =3D pycassa.ConnectionPool('Blast', server_lis= t=3D['xxx.xx.xx.xx'])
=A0 =A0 =A0 =A0 cf =3D pycassa.ColumnFamily(pool, 'Blast_NR')
=A0 =A0 =A0 =A0 inp_file=3Dopen("pycassa_100%_query")
=A0 =A0 =A0 =A0 for key in inp_file:
=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 key=3Dkey.strip()
=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 cf.get(key)

Does Java clients like Hector/Astynax help here.. I am more
comfortable with Python than Java and our existing application is also
in Python.

thanks
pradeep


On Thu, Jan 17, 2013 at 2:08 PM, Edward Capriolo <edlinuxguru@gmail.com> wrote:
> Wow you managed to do a load test through the cassandra-cli. There sho= uld be
> a merit badge for that.
>
> You should use the built in stress tool or YCSB.
>
> The CLI has to do much more string conversion then a normal client wou= ld and
> it is not built for performance. You will definitely get better number= s
> through other means.
>
> On Thu, Jan 17, 2013 at 2:10 PM, Pradeep Kumar Mantha <pradeepm66@gmail.com>
> wrote:
>>
>> Hi,
>>
>> I am trying to maximize execution of the number of read queries/se= cond.
>>
>> Here is my cluster configuration.
>>
>> Replication - Default
>> 12 Data Nodes.
>> 16 Client Nodes - used for querying.
>>
>> Each client node executes 32 threads - each thread executes 76896 = read
>> queries using =A0cassandra-cli tool.
>> =A0 =A0 =A0 =A0i.e all the read queries are stored in a file and t= hat file is
>> given to cassandra-cli tool ( using -f option ) which is executed = by a
>> thread.
>> so, total number of queries for 16 client Nodes is 16 * 32 * 76896= .
>>
>> The read queries on each client node submitted at the same time. T= he
>> time taken for 16 * 32 * 76896 read queries is nearly 742 seconds = -
>> which is nearly 53k transactions/second.
>>
>> I would like to know if there is any other way/tool through which = I
>> can improve the number of transactions/second.
>> Is the performance affected by cassandra-cli tool?
>>
>> thanks
>> pradeep
>
>



--
Tyler Hobbs
DataStax
--bcaec553fde0e5e41204d3850dc4--