incubator-cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From James Masson <james.mas...@opigram.com>
Subject Re: Cassandra read throughput with little/no caching.
Date Fri, 21 Dec 2012 15:36:31 GMT

Hi,

thanks for the reply

On 21/12/12 14:36, Yiming Sun wrote:
> I have a few questions for you, James,
>
> 1. how many nodes are in your Cassandra ring?

2 or 3 - depending on environment - it doesn't seem to make a difference 
to throughput very much. What is a 30 minute task on a 2 node 
environment is a 30 minute task on a 3 node environment.

> 2. what is the replication factor?

1

> 3. when you say sequentially, what do you mean?  what Partitioner do you
> use?

The data is organised by date - the keys are read sequentially in order, 
only once.

Random partitioner - the data is equally spread across the nodes to 
avoid hotspots.

> 4. how many columns per row?  how much data per row?  per column?

varies - described in the schema.

create keyspace mykeyspace
   with placement_strategy = 'SimpleStrategy'
   and strategy_options = {replication_factor : 1}
   and durable_writes = true;


create column family entities
   with column_type = 'Standard'
   and comparator = 'BytesType'
   and default_validation_class = 'BytesType'
   and key_validation_class = 'AsciiType'
   and read_repair_chance = 0.0
   and dclocal_read_repair_chance = 0.0
   and gc_grace = 0
   and min_compaction_threshold = 4
   and max_compaction_threshold = 32
   and replicate_on_write = false
   and compaction_strategy = 
'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy'
   and caching = 'NONE'
   and column_metadata = [
     {column_name : '64656c65746564',
     validation_class : BytesType,
     index_name : 'deleted_idx',
     index_type : 0},
     {column_name : '6576656e744964',
     validation_class : TimeUUIDType,
     index_name : 'eventId_idx',
     index_type : 0},
     {column_name : '7061796c6f6164',
     validation_class : UTF8Type}];

2 columns per row here - about 200Mb of data in total


create column family events
   with column_type = 'Standard'
   and comparator = 'BytesType'
   and default_validation_class = 'BytesType'
   and key_validation_class = 'TimeUUIDType'
   and read_repair_chance = 0.0
   and dclocal_read_repair_chance = 0.0
   and gc_grace = 0
   and min_compaction_threshold = 4
   and max_compaction_threshold = 32
   and replicate_on_write = false
   and compaction_strategy = 
'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy'
   and caching = 'NONE';

1 column per row - about 300Mb of data

create column family intervals
   with column_type = 'Standard'
   and comparator = 'BytesType'
   and default_validation_class = 'BytesType'
   and key_validation_class = 'AsciiType'
   and read_repair_chance = 0.0
   and dclocal_read_repair_chance = 0.0
   and gc_grace = 0
   and min_compaction_threshold = 4
   and max_compaction_threshold = 32
   and replicate_on_write = false
   and compaction_strategy = 
'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy'
   and caching = 'NONE';

variable columns per row - about 40Mb of data.


> 5. what client library do you use to access Cassandra?  (Hector?).  Is
> your client code single threaded?

Hector - yes, the processing side of the client is single threaded, but 
is largely waiting for cassandra responses and has plenty of CPU headroom.


I guess what I'm most interested in is why the discrepancy in between 
read/write latency - although I understand the data volume is much 
larger in reads, even though the request rate is lower.

Network usage on a cassandra box barely gets above 20Mbit, including 
inter-cluster comms. Averages 5mbit client<>cassandra

There is near zero disk I/O, and what little there is is served sub 1ms. 
Storage is backed by a very fast SAN, but like I said earlier, the 
dataset just about fits in the Linux disk cache. 2Gb VM, 512Mb cassandra 
heap - GCs are nice and quick, no JVM memory problems, used heap 
oscillates between 280-350Mb.

Basically, I'm just puzzled as cassandra doesn't behave as I would 
expect. Huge CPU use in cassandra for very little throughput. I'm 
struggling to find anything that's wrong with the environment, there's 
no bottleneck that I can see.

thanks

James M



>
>
> On Fri, Dec 21, 2012 at 7:27 AM, James Masson <james.masson@opigram.com
> <mailto:james.masson@opigram.com>> wrote:
>
>
>     Hi list-users,
>
>     We have an application that has a relatively unusual access pattern
>     in cassandra 1.1.6
>
>     Essentially we read an entire multi hundred megabyte column family
>     sequentially (little chance of a cassandra cache hit), perform some
>     operations on the data, and write the data back to another column
>     family in the same keyspace.
>
>     We do about 250 writes/sec and 100 reads/sec during this process.
>     Write request latency is about 900 microsecs, read request latency
>     is about 4000 microsecs.
>
>     * First Question: Do these numbers make sense?
>
>     read-request latency seems a little high to me, cassandra hasn't had
>     a chance to cache this data, but it's likely in the Linux disk
>     cache, given the sizing of the node/data/jvm.
>
>     thanks
>
>     James M
>
>

Mime
View raw message