cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Richard grossman <>
Subject Re: Time to insert bulk data is very high comparing to database
Date Sun, 08 Nov 2009 17:47:45 GMT

On Sun, Nov 8, 2009 at 3:56 PM, Jonathan Ellis <> wrote:

> - You’ll easily double performance by setting the log level from DEBUG
> to INFO (unclear if you actually did this, so mentioning it for
> completeness)
No problem I've check all is on INFO

> - 0.4.1 has bad default GC options. the defaults will be fixed for
> 0.4.2 and 0.5, but it’s easy to tweak for 0.4.1:
Sorry I can't find the post talking about that I can't open this link on mac

> - it doesn't look like you're doing parallel inserts.  you should have
> at least a few dozen to a few hundred threads if you want to measure
> throughput rather than just latency.  run the client on a machine that
> is not running cassandra, since it can also use a decent amount of
> CPU.
You mean by parallel to write a code running the insert into thread instead
of one by one ?
If it's the case is the Thrift API are thread safe ?. Ho do you manage the
opening and the close of the connection ? like single thread open one and
closed at the end.

>  - using batch_insert will be much faster than multiple single-column
> inserts to the same row
> I've made modification like this :
    public void insertChannelShow(String showId, String channelId, String
airDate,  String duration, String title, String parentShowId, String genre,
String price, String subtitle) throws Exception {
        Calendar calendar = Calendar.getInstance();
        Date air = dateFormat.parse(airDate);

        String key = String.valueOf(calendar.getTimeInMillis()) + ":" +
showId + ":" + channelId;

        long timestamp = System.currentTimeMillis();

        Map<String, List<ColumnOrSuperColumn>> insertDataMap = new
HashMap<String, List<ColumnOrSuperColumn>>();
        List<ColumnOrSuperColumn> rowData = new

        rowData.add(new ColumnOrSuperColumn(new
Column(("duration").getBytes("UTF-8"), duration.getBytes("UTF-8"),
timestamp), null));
        rowData.add(new ColumnOrSuperColumn(new
Column(("title").getBytes("UTF-8"), title.getBytes("UTF-8"), timestamp),
        rowData.add(new ColumnOrSuperColumn(new
Column(("parentShowId").getBytes("UTF-8"), parentShowId.getBytes("UTF-8"),
timestamp), null));
        rowData.add(new ColumnOrSuperColumn(new
Column(("genre").getBytes("UTF-8"), genre.getBytes("UTF-8"), timestamp),
        rowData.add(new ColumnOrSuperColumn(new
Column(("price").getBytes("UTF-8"), price.getBytes("UTF-8"), timestamp),
        rowData.add(new ColumnOrSuperColumn(new
Column(("subtitle").getBytes("UTF-8"), subtitle.getBytes("UTF-8"),
timestamp), null));

        insertDataMap.put("channelShow", rowData);

        cassandraClient.batch_insert("Keyspace1", key, insertDataMap,

        insertDataMap = null;
        rowData = null;

Is it what you think about?

Anyway I've opened a new small instance in amazon to run the insert not one
running cassandra and give one of the cassandra server ip. It's not improve
nothing. The client machine is 1% CPU the server machines are 1% CPU.

The problem come when the data is distributed between the 2 cassandra
servers because all the time the data go to commitlog of the first server
all is ok ~2000 rows/second. But when the data goes to the second server
it's falling very sharply ~200 rows /second.

I've read that I can check latency with JMX. it's ok but I can't succed to
connect JMX agent on amazon the params are OK but nothing help the jconsole
on my side refuse to connect. Is there something else I can check ?


View raw message