incubator-cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Richard grossman <richie...@gmail.com>
Subject Re: Time to insert bulk data is very high comparing to database
Date Sun, 08 Nov 2009 17:47:45 GMT
Hi

On Sun, Nov 8, 2009 at 3:56 PM, Jonathan Ellis <jbellis@gmail.com> wrote:

> - You’ll easily double performance by setting the log level from DEBUG
> to INFO (unclear if you actually did this, so mentioning it for
> completeness)
>
No problem I've check all is on INFO


> - 0.4.1 has bad default GC options. the defaults will be fixed for
> 0.4.2 and 0.5, but it’s easy to tweak for 0.4.1:
>
> http://mail-archives.apache.org/mod_mbox/incubator-cassandra-user/200910.mbox
>
Sorry I can't find the post talking about that I can't open this link on mac
os



> - it doesn't look like you're doing parallel inserts.  you should have
> at least a few dozen to a few hundred threads if you want to measure
> throughput rather than just latency.  run the client on a machine that
> is not running cassandra, since it can also use a decent amount of
> CPU.
>
You mean by parallel to write a code running the insert into thread instead
of one by one ?
If it's the case is the Thrift API are thread safe ?. Ho do you manage the
opening and the close of the connection ? like single thread open one and
closed at the end.


>  - using batch_insert will be much faster than multiple single-column
> inserts to the same row
>
> I've made modification like this :
    public void insertChannelShow(String showId, String channelId, String
airDate,  String duration, String title, String parentShowId, String genre,
String price, String subtitle) throws Exception {
        Calendar calendar = Calendar.getInstance();
        dateFormat.setCalendar(calendar);
        Date air = dateFormat.parse(airDate);
        calendar.setTime(air);

        String key = String.valueOf(calendar.getTimeInMillis()) + ":" +
showId + ":" + channelId;

        long timestamp = System.currentTimeMillis();

        Map<String, List<ColumnOrSuperColumn>> insertDataMap = new
HashMap<String, List<ColumnOrSuperColumn>>();
        List<ColumnOrSuperColumn> rowData = new
ArrayList<ColumnOrSuperColumn>();

        rowData.add(new ColumnOrSuperColumn(new
Column(("duration").getBytes("UTF-8"), duration.getBytes("UTF-8"),
timestamp), null));
        rowData.add(new ColumnOrSuperColumn(new
Column(("title").getBytes("UTF-8"), title.getBytes("UTF-8"), timestamp),
null));
        rowData.add(new ColumnOrSuperColumn(new
Column(("parentShowId").getBytes("UTF-8"), parentShowId.getBytes("UTF-8"),
timestamp), null));
        rowData.add(new ColumnOrSuperColumn(new
Column(("genre").getBytes("UTF-8"), genre.getBytes("UTF-8"), timestamp),
null));
        rowData.add(new ColumnOrSuperColumn(new
Column(("price").getBytes("UTF-8"), price.getBytes("UTF-8"), timestamp),
null));
        rowData.add(new ColumnOrSuperColumn(new
Column(("subtitle").getBytes("UTF-8"), subtitle.getBytes("UTF-8"),
timestamp), null));

        insertDataMap.put("channelShow", rowData);

        cassandraClient.batch_insert("Keyspace1", key, insertDataMap,
ConsistencyLevel.ONE);

        insertDataMap.clear();
        rowData.clear();
        insertDataMap = null;
        rowData = null;
    }


Is it what you think about?

Anyway I've opened a new small instance in amazon to run the insert not one
running cassandra and give one of the cassandra server ip. It's not improve
nothing. The client machine is 1% CPU the server machines are 1% CPU.

The problem come when the data is distributed between the 2 cassandra
servers because all the time the data go to commitlog of the first server
all is ok ~2000 rows/second. But when the data goes to the second server
it's falling very sharply ~200 rows /second.

I've read that I can check latency with JMX. it's ok but I can't succed to
connect JMX agent on amazon the params are OK but nothing help the jconsole
on my side refuse to connect. Is there something else I can check ?

Thanks

Mime
View raw message