incubator-cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jonathan Ellis <jbel...@gmail.com>
Subject Re: Cassandra versus HBase performance study
Date Thu, 04 Feb 2010 01:05:11 GMT
We have really obvious optimizations to make there that haven't been
done because the biggest contributors so far are using
RandomPartitioner...

Are you using get_key_range or get_range_slice for scanning?  The
former is even slower and deprecated.

With get_range_slice your comparator matters, BytesType is fastest.

-Jonathan

On Wed, Feb 3, 2010 at 6:45 PM, Brian Frank Cooper
<cooperb@yahoo-inc.com> wrote:
> 0.5 does seem to be significantly faster - the latency is better and it provides significantly
more throughput. I'm updating my charts with new values now.
>
> One thing that is puzzling is the scan performance. The scan experiment is to scan between
1-100 records on each request. My 6 node Cassandra cluster is only getting up to about 230
operations/sec, compared to >1400 ops/sec for other systems. The latency is quite a bit
higher. A chart with these results is here:
>
> http://www.brianfrankcooper.net/pubs/scans.png
>
> Is this the expected performance? I'm using the OrderPreservingPartitioner with InitialToken
values that should evenly partition the data (and the amount of data in /var/cassandra/data
is about the same on all servers). I'm using get_range_slice() from Java (code snippet below).
>
> At the max throughput (230 ops/sec), when latency is over 1.2 sec, CPU usage varies from
~5% to ~72% on different boxes. Disk busy varies from 60% to 90% (and the machine with the
busiest disk is not the one with highest CPU usage.) Network utilization (eth0 %util both
in and out) varies from 15%-40% on different boxes. So clearly there is some imbalance (and
the workload itself is skewed via a Zipfian distribution) but I'm surprised that the latencies
are so high even in this case.
>
> Code snippet - fields is a Set<String> listing the columns I want; recordcount
is the number of records to return.
>
> SlicePredicate predicate;
> if (fields==null)
> {
>        predicate = new SlicePredicate(null,new SliceRange(new byte[0], new byte[0],false,1000000));
> }
> else
> {
>        Vector<byte[]> fieldlist=new Vector<byte[]>();
>        for (String s : fields)
>        {
>                fieldlist.add(s.getBytes("UTF-8"));
>        }
>        predicate = new SlicePredicate(fieldlist,null);
> }
> ColumnParent parent = new ColumnParent("data", null);
>
> List<KeySlice> results = client.get_range_slice(table,parent,predicate,startkey,"",recordcount,ConsistencyLevel.ONE);
>
> Thanks!
>
> Brian
>
> ________________________________________
> From: Brian Frank Cooper
> Sent: Saturday, January 30, 2010 7:56 AM
> To: cassandra-user@incubator.apache.org
> Subject: RE: Cassandra versus HBase performance study
>
> Good idea, we'll benchmark 0.5 next.
>
> brian
>
> -----Original Message-----
> From: Jonathan Ellis [mailto:jbellis@gmail.com]
> Sent: Friday, January 29, 2010 1:13 PM
> To: cassandra-user@incubator.apache.org
> Subject: Re: Cassandra versus HBase performance study
>
> Thanks for posting your results; it is an interesting read and we are
> pleased to beat HBase in most workloads. :)
>
> Since you originally benchmarked 0.4.2, you might be interested in the
> speed gains in 0.5.  A couple graphs here:
> http://spyced.blogspot.com/2010/01/cassandra-05.html
>
> 0.6 (beta in a few weeks?) is looking even better. :)
>
> -Jonathan
>

Mime
View raw message