cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sylvain Lebresne <sylv...@datastax.com>
Subject Re: low performance inserting
Date Tue, 03 May 2011 14:07:20 GMT
There is probably a fair number of things you'd have to make sure you do to
improve the write performance on the Cassandra side (starting by using multiple
threads to do the insertion), but the first thing is probably to start
comparing things
that are at least mildly comparable. If you do inserts in Cassandra,
you should try
to do inserts in MySQL too, not "load data infile" (which really is
just a bulk loading
utility). And as stated here
http://dev.mysql.com/doc/refman/5.0/en/insert-speed.html:
"When loading a table from a text file, use LOAD DATA INFILE. This is
usually 20 times
faster than using INSERT statements."

--
Sylvain

On Tue, May 3, 2011 at 12:30 PM, charles THIBAULT
<charl.thibault@gmail.com> wrote:
> Hello everybody,
>
> first: sorry for my english in advance!!
>
> I'm getting started with Cassandra on a 5 nodes cluster inserting data
> with the pycassa API.
>
> I've read everywere on internet that cassandra's performance are better than
> MySQL
> because of the writes append's only into commit logs files.
>
> When i'm trying to insert 100 000 rows with 10 columns per row with batch
> insert, I'v this result: 27 seconds
> But with MySQL (load data infile) this take only 2 seconds (using indexes)
>
> Here my configuration
>
> cassandra version: 0.7.5
> nodes : 192.168.1.210, 192.168.1.211, 192.168.1.212, 192.168.1.213,
> 192.168.1.214
> seed: 192.168.1.210
>
> My script
> *************************************************************************************************************
> #!/usr/bin/env python
>
> import pycassa
> import time
> import random
> from cassandra import ttypes
>
> pool = pycassa.connect('test', ['192.168.1.210:9160'])
> cf = pycassa.ColumnFamily(pool, 'test')
> b = cf.batch(queue_size=50,
> write_consistency_level=ttypes.ConsistencyLevel.ANY)
>
> tps1 = time.time()
> for i in range(100000):
>     columns = dict()
>     for j in range(10):
>         columns[str(j)] = str(random.randint(0,100))
>     b.insert(str(i), columns)
> b.send()
> tps2 = time.time()
>
>
> print("execution time: " + str(tps2 - tps1) + " seconds")
> *************************************************************************************************************
>
> what I'm doing rong ?
>

Mime
View raw message