Return-Path: X-Original-To: apmail-cassandra-user-archive@www.apache.org Delivered-To: apmail-cassandra-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 9FB40B7E9 for ; Sun, 22 Jan 2012 14:51:36 +0000 (UTC) Received: (qmail 6687 invoked by uid 500); 22 Jan 2012 14:51:34 -0000 Delivered-To: apmail-cassandra-user-archive@cassandra.apache.org Received: (qmail 6613 invoked by uid 500); 22 Jan 2012 14:51:33 -0000 Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@cassandra.apache.org Delivered-To: mailing list user@cassandra.apache.org Received: (qmail 6605 invoked by uid 99); 22 Jan 2012 14:51:33 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 22 Jan 2012 14:51:33 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of edlinuxguru@gmail.com designates 209.85.210.172 as permitted sender) Received: from [209.85.210.172] (HELO mail-iy0-f172.google.com) (209.85.210.172) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 22 Jan 2012 14:51:26 +0000 Received: by iagf6 with SMTP id f6so2279547iag.31 for ; Sun, 22 Jan 2012 06:51:06 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; bh=x+WG4T5yUTTXCZjGn/xMm633Vozp9v8T24lxA/FBMc8=; b=x8KGrQp2nq3/Yf2Gg6CdQIGOPHz8YjyNHBeSjcw1cGmvbjh8ztvBXcm3EykeHMKaUA 0Q6da9p6oYcqoV1m9shtaFlEui8RmCWgI3prhQyMkzCNRUtDJyBQXvLvEGq7jHlk/Gk9 YbyVwbe/EcGKnM6xS0e4QuhFu7ItCkrla2NU8= MIME-Version: 1.0 Received: by 10.50.88.163 with SMTP id bh3mr6151160igb.0.1327243866291; Sun, 22 Jan 2012 06:51:06 -0800 (PST) Received: by 10.42.240.199 with HTTP; Sun, 22 Jan 2012 06:51:06 -0800 (PST) In-Reply-To: References: Date: Sun, 22 Jan 2012 09:51:06 -0500 Message-ID: Subject: Re: Cassandra x MySQL Sharded - Insert Comparison From: Edward Capriolo To: user@cassandra.apache.org Content-Type: multipart/alternative; boundary=e89a8f3ba15d2e3eeb04b71f0b3d --e89a8f3ba15d2e3eeb04b71f0b3d Content-Type: text/plain; charset=ISO-8859-1 In some sense 1 for one performance "almost" does not matter. Thou I bet you can get Cassandra better (I remember old school ycsb white paper benches against a sharded mysql). One of the main bullet points of Cassandra is if you want to grow from 4 nodes, to 8 nodes, to 14 nodes, and so on, Cassandra is elastic and supports online adding and removing of nodes. A do-it-yourself hash mod this algorithm really has no upgrade path Edward On Sun, Jan 22, 2012 at 9:26 AM, Chris Gerken wrote: > Howdy Gustavo, > > One thing that jumped out at me is your having put two cassandra images on > the same box. There may be enough CPU and memory for the two images > combined but you may be seeing some other resource not being shared so > nicely - network card bandwidth, for example. > > More generally, the real question is what the bottleneck is (for both > db's, actually). Start with Cassandra running in that configuration and > start with one client thread sending one request a second. Look at the > CPU, network and memory metrics for all boxes (including the client). > Nothing should be even close to maxing out that that throughout. Now > incrementally increase one of the test parameters (number of clients or > number of inserts per second) just a bit (say from one transaction to 5) > and note the above metrics. Keep slowly increasing the test parameters, > one at a time, until one of the metrics maxes out. That's the bottleneck > you're wondering about. Fix that and the db, be it Cassandra or MySQL) > will move ahead of the other performance-wise. Turn your attention to the > other db and repeat. > > - Chris Gerken > > On Jan 22, 2012, at 7:10 AM, Gustavo Gustavo wrote: > > Hello, > > I've set up a testing evironment for Cassandra and MySQL, to compare both, > regarding *performance only*. And I must admit that I was expecting > Cassandra to beat MySQL. But I've not seen this happening up to now. > My application/use case is INSERT intensive, since I'm not updating > anything, just inserting all the time. > To compare both I created virtual machines with Ubuntu 11.10, and > installed the latest versions of each datastore. Each VM has 1GB of RAM. > I've used VMs as a way to give both datastores an equal sandbox. > MySQL is set up to work as sharded, with 2 databases, that means that > records are inserted to a specific instance based on key % 2. The engine is > MyISAM (InnoDB was really slow and not really needed to my case). There's a > primary compound key (integer and datetime columns) in this test table. > Let's name the "nodes" MySQL1 and MySQL2. > Cassandra is set up to work with 4 nodes, with keys (tokens) set up to > distribute records evenly across the 4 nodes (nodetool ring reports 25% to > each node), replication factor 1 and RandomPartitioner, the other configs > are left to default. Let's name the nodes Cassandra1, Cassandra2, > Cassandra3 and Cassandra4. > > I'm using 2 physical machines (Windows7) to host the 4 (Cassandra) or 2 > (MySQL) virtual machines, this way: > Machine1: MySQL1, Cassandra1, Cassandra3 > Machine2: MySQL2, Cassandra2, Cassandra4 > The machines have CPU and RAM enough to host Cassandra Cluster or MySQL > "Cluster" at a time. > > The client test applicatin is running in a third physical machine, with 8 > threads doing inserts. The test application is written in C# (Windows7) > using Aquiles high-level client. > > My use case is a vehicle tracking system. So, let's suppose, from minute > to minute, the vehicle sends its position together with some other GPS data > and vehicle status information. The columns in my Cassandra cluster are > just the DateTime (long value) of a position for a specific vehicle, and > the value is all the other data serialized to binary format. Therefore, my > CF really grows in columns number. So all data is inserted only to one > CF/Table named Positions. The key to Cassandra is the VehicleID and to > MySQL VehicleID + PositionDateTime (MySQL creates an index to this > automatically). Important to note that MySQL threw tons of connection > exceptions, even though, the insert was retried until it got through MySQL. > > My test case was to insert 1k positions for 1k vehicles to 10 days - which > gives 10.000.000 of inserts. > > The final thoughtput that my application had for this scenario was: > > Cassandra x 4 > 2012-01-21 11:45:38,044 #6 [Logger.Log] INFO - >> Inserted 10000 > positions for 1000 vehicles (10000000 inserts): > 2012-01-21 11:45:38,082 #6 [Logger.Log] INFO - >> Total Time: > 2:37:03,359 > 2012-01-21 11:45:38,085 #6 [Logger.Log] INFO - >> Throughput: > 1061 inserts/s > > And for MySQL x 2 > 2012-01-21 14:26:25,197 #6 [Logger.Log] INFO - >> Inserted 10000 > positions for 1000 vehicles (10000000 inserts): > 2012-01-21 14:26:25,250 #6 [Logger.Log] INFO - >> Total Time: > 2:06:25,914 > 2012-01-21 14:26:25,263 #6 [Logger.Log] INFO - >> Throughput: > 1318 inserts/s > > Is there something that I'm missing here? Is this excepted? Or the problem > is somewhere else and that's hard to say looking at this description? > > Cheers, > Gustavo > > > --e89a8f3ba15d2e3eeb04b71f0b3d Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable In some sense 1 for one performance "almost" does not matter. Tho= u I bet you can get Cassandra better (I remember old school ycsb white pape= r benches against a sharded mysql).

One of the main bullet points o= f Cassandra is if you want to grow from 4 nodes, to 8 nodes, to 14 nodes, a= nd so on, Cassandra is elastic and supports online adding and removing of n= odes. A do-it-yourself hash mod this algorithm really has no upgrade path
Edward

On Sun, Jan 22, 2012 at 9:26 A= M, Chris Gerken <chrisgerken@mindspring.com> wrote:
Howdy Gustavo,

One t= hing that jumped out at me is your having put two cassandra images on the s= ame box. =A0There may be enough CPU and memory for the two images combined = but you may be seeing some other resource not being shared so nicely - netw= ork card bandwidth, for example.

More generally, the real question is what the bottlenec= k is (for both db's, actually). =A0Start with Cassandra running in that= configuration and start with one client thread sending one request a secon= d. =A0Look at the CPU, network and memory metrics for all boxes (including = the client). =A0Nothing should be even close to maxing out that that throug= hout. =A0Now incrementally increase one of the test parameters (number of c= lients or number of inserts per second) just a bit (say from one transactio= n to 5) and note the above metrics. =A0Keep slowly increasing the test para= meters, one at a time, until one of the metrics maxes out. =A0That's th= e bottleneck you're wondering about. =A0Fix that and the db, be it Cass= andra or MySQL) will move ahead of the other performance-wise. =A0Turn your= attention to the other db and repeat.

- Chris Gerken

On Jan 22, 2012, at 7:10 AM, Gustavo Gustavo wrote:

Hello,

I've set up a testing evironment for Cassand= ra and MySQL, to compare both, regarding *performance only*. And I must adm= it that I was expecting Cassandra to beat MySQL. But I've not seen this= happening up to now.
My application/use case is INSERT intensive, since I'm not updating any= thing, just inserting all the time.
To compare both I created virtual ma= chines with Ubuntu 11.10, and installed the latest versions of each datasto= re. Each VM has 1GB of RAM. I've used VMs as a way to give both datasto= res an equal sandbox.
MySQL is set up to work as sharded, with 2 databases, that means that recor= ds are inserted to a specific instance based on key % 2. The engine is MyIS= AM (InnoDB was really slow and not really needed to my case). There's a= primary compound key (integer and datetime columns) in this test table. Let's name the "nodes" MySQL1 and MySQL2.
Cassandra is set= up to work with 4 nodes, with keys (tokens) set up to distribute records e= venly across the 4 nodes (nodetool ring reports 25% to each node), replicat= ion factor 1 and RandomPartitioner, the other configs are left to default. = Let's name the nodes Cassandra1, Cassandra2, Cassandra3 and Cassandra4.=

I'm using 2 physical machines (Windows7) to host the 4 (Cassandra) = or 2 (MySQL) virtual machines, this way:
Machine1: MySQL1, Cassandra1, C= assandra3
Machine2: MySQL2, Cassandra2, Cassandra4
The machines have = CPU and RAM enough to host Cassandra Cluster or MySQL "Cluster" a= t a time.

The client test applicatin is running in a third physical machine, with= 8 threads doing inserts. The test application is written in C# (Windows7) = using Aquiles high-level client.

My use case is a vehicle tracking s= ystem. So, let's suppose, from minute to minute, the vehicle sends its = position together with some other GPS data and vehicle status information. = The columns in my Cassandra cluster are just the DateTime (long value) of a= position for a specific vehicle, and the value is all the other data seria= lized to binary format. Therefore, my CF really grows in columns number. So= all data is inserted only to one CF/Table named Positions. The key to Cass= andra is the VehicleID and to MySQL VehicleID + PositionDateTime (MySQL cre= ates an index to this automatically). Important to note that MySQL threw to= ns of connection exceptions, even though, the insert was retried until it g= ot through MySQL.

My test case was to insert 1k positions for 1k vehicles to 10 days - wh= ich gives 10.000.000 of inserts.

The final thoughtput that my applic= ation had for this scenario was:

Cassandra x 4
2012-01-21 11:45= :38,044 #6=A0=A0=A0=A0=A0=A0=A0=A0 [Logger.Log] INFO=A0 - >> Inserted= 10000 positions for 1000 vehicles (10000000 inserts):
20= 12-01-21 11:45:38,082 #6=A0=A0=A0=A0=A0=A0=A0=A0 [Logger.Log] INFO=A0 -= >> Total Time: 2:37:03,359
2012-01-21 11:45:38,085 #6=A0=A0=A0= =A0=A0=A0=A0=A0 [Logger.Log] INFO=A0 - >> Throughput: 1061 inserts/s<= br>
And for MySQL x 2
2012-01-21 14:26:25,197 #6=A0=A0=A0=A0=A0=A0=A0= =A0 [Logger.Log] INFO=A0 - >> Inserted 10000 positions for 1000 vehic= les (10000000 inserts):
20= 12-01-21 14:26:25,250 #6=A0=A0=A0=A0=A0=A0=A0=A0 [Logger.Log] INFO=A0 -= >> Total Time: 2:06:25,914
2012-01-21 14:26:25,263 #6=A0=A0=A0= =A0=A0=A0=A0=A0 [Logger.Log] INFO=A0 - >> Throughput: 1318 inserts/s<= br>
Is there something that I'm missing here? Is this excepted? Or the = problem is somewhere else and that's hard to say looking at this descri= ption?

Cheers,
Gustavo



--e89a8f3ba15d2e3eeb04b71f0b3d--