incubator-cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From aaron morton <aa...@thelastpickle.com>
Subject Re: Cassandra read throughput with little/no caching.
Date Sun, 23 Dec 2012 20:18:19 GMT
First, the non helpful advice, I strongly suggest changing the data model so you do not have
100MB+ rows. They will make life harder. 

>               Write request latency is about 900 microsecs, read request
>         latency
>              is about 4000 microsecs.
> 
>      

4 milliseconds to drag 100 to 300 MB data off a SAN, through your network, into C* and out
to the client does not sound terrible at first glance. Can you benchmark and individual request
to get an idea of the throughput? 

I would recommend removing the SAN from the equation, cassandra will run better with local
disks. It also introduces a single point of failure into a distributed system. 

> but it's likely in the Linux disk cache, given the sizing of the node/data/jvm.

Are you sure that the local Linux machine is going to cache files stored on the SAN ? 

Cheers

-----------------
Aaron Morton
Freelance Cassandra Developer
New Zealand

@aaronmorton
http://www.thelastpickle.com

On 22/12/2012, at 6:56 AM, Yiming Sun <yiming.sun@gmail.com> wrote:

> James, you could experiment with Row cache, with off-heap JNA cache, and see if it helps.
 My own experience with row cache was not good, and the OS cache seemed to be most useful,
but in my case, our data space was big, over 10TB.  Your sequential access pattern certainly
doesn't play well with LRU, but giving the small data space you have, you may be able to fit
the data from one column family entirely into the row cache.
> 
> 
> On Fri, Dec 21, 2012 at 12:03 PM, James Masson <james.masson@opigram.com> wrote:
> 
> 
> On 21/12/12 16:27, Yiming Sun wrote:
> James, using RandomPartitioner, the order of the rows is random, so when
> you request these rows in "Sequential" order (sort by the date?),
> Cassandra is not reading them sequentially.
> 
> Yes, I understand the "next" row to be retrieved in sequence is likely to be on a different
node, and the ordering is random. I'm using the word sequential to try to explain that the
data being requested is in an order, and not repeated, until the next cycle. The data is not
guaranteed to be of a size that is cache-able as a whole.
> 
> 
> 
> The size of the data, 200Mb, 300Mb , and 40Mb, are these the size for
> each column? Or are these the total size of the entire column family?
>   It wasn't too clear to me.  But if these are the total size of the
> column families, you will be able to fit them mostly in memory, so you
> should enable row cache.
> 
> Size of the column family, on a single node. Row caching is off at the moment.
> 
> Are you saying that I should increase the JVM heap to fit some data in the row cache,
at the expense of linux disk caching?
> 
> Bear in mind that the data is only going to be re-requested in sequence again - I'm not
sure what the value is in the cassandra native caching if rows are not re-requested before
being evicted.
> 
> My current key-cache hit-rates are near zero on this workload, hence I'm interested in
cassandra's zero-cache performance. Unless I can guarantee to fit the entire data-set in memory,
it's difficult to justify using memory on a cassandra cache if LRU and workload means it's
not actually a benefit.
> 
> 
> 
> I happen to have done some performance tests of my own on cassandra,
> mostly on the read, and was also only able to get less than 6MB/sec read
> rate out of a cluster of 6 nodes RF2 using a single threaded client.
>   But it makes a huge difference when I changed the client to an
> asynchronous multi-threaded structure.
> 
> 
> Yes, I've been talking to the developers about having a separate thread or two that keeps
cassandra busy, keeping Disruptor (http://lmax-exchange.github.com/disruptor/) fed to do the
processing work.
> 
> But this all doesn't change the fact that under this zero-cache workload, cassandra seems
to be very CPU expensive for throughput.
> 
> thanks
> 
> James M
> 
> 
> 
> 
> On Fri, Dec 21, 2012 at 10:36 AM, James Masson <james.masson@opigram.com
> <mailto:james.masson@opigram.com>> wrote:
> 
> 
>     Hi,
> 
>     thanks for the reply
> 
> 
>     On 21/12/12 14:36, Yiming Sun wrote:
> 
>         I have a few questions for you, James,
> 
>         1. how many nodes are in your Cassandra ring?
> 
> 
>     2 or 3 - depending on environment - it doesn't seem to make a
>     difference to throughput very much. What is a 30 minute task on a 2
>     node environment is a 30 minute task on a 3 node environment.
> 
> 
>         2. what is the replication factor?
> 
> 
>     1
> 
>         3. when you say sequentially, what do you mean?  what
>         Partitioner do you
>         use?
> 
> 
>     The data is organised by date - the keys are read sequentially in
>     order, only once.
> 
>     Random partitioner - the data is equally spread across the nodes to
>     avoid hotspots.
> 
> 
>         4. how many columns per row?  how much data per row?  per column?
> 
> 
>     varies - described in the schema.
> 
>     create keyspace mykeyspace
>        with placement_strategy = 'SimpleStrategy'
>        and strategy_options = {replication_factor : 1}
>        and durable_writes = true;
> 
> 
>     create column family entities
>        with column_type = 'Standard'
>        and comparator = 'BytesType'
>        and default_validation_class = 'BytesType'
>        and key_validation_class = 'AsciiType'
>        and read_repair_chance = 0.0
>        and dclocal_read_repair_chance = 0.0
>        and gc_grace = 0
>        and min_compaction_threshold = 4
>        and max_compaction_threshold = 32
>        and replicate_on_write = false
>        and compaction_strategy =
>     'org.apache.cassandra.db.__compaction.__SizeTieredCompactionStrategy'
> 
>        and caching = 'NONE'
>        and column_metadata = [
>          {column_name : '64656c65746564',
>          validation_class : BytesType,
>          index_name : 'deleted_idx',
>          index_type : 0},
>          {column_name : '6576656e744964',
>          validation_class : TimeUUIDType,
>          index_name : 'eventId_idx',
>          index_type : 0},
>          {column_name : '7061796c6f6164',
>          validation_class : UTF8Type}];
> 
>     2 columns per row here - about 200Mb of data in total
> 
> 
>     create column family events
>        with column_type = 'Standard'
>        and comparator = 'BytesType'
>        and default_validation_class = 'BytesType'
>        and key_validation_class = 'TimeUUIDType'
>        and read_repair_chance = 0.0
>        and dclocal_read_repair_chance = 0.0
>        and gc_grace = 0
>        and min_compaction_threshold = 4
>        and max_compaction_threshold = 32
>        and replicate_on_write = false
>        and compaction_strategy =
>     'org.apache.cassandra.db.__compaction.__SizeTieredCompactionStrategy'
> 
>        and caching = 'NONE';
> 
>     1 column per row - about 300Mb of data
> 
>     create column family intervals
>        with column_type = 'Standard'
>        and comparator = 'BytesType'
>        and default_validation_class = 'BytesType'
>        and key_validation_class = 'AsciiType'
>        and read_repair_chance = 0.0
>        and dclocal_read_repair_chance = 0.0
>        and gc_grace = 0
>        and min_compaction_threshold = 4
>        and max_compaction_threshold = 32
>        and replicate_on_write = false
>        and compaction_strategy =
>     'org.apache.cassandra.db.__compaction.__SizeTieredCompactionStrategy'
> 
>        and caching = 'NONE';
> 
>     variable columns per row - about 40Mb of data.
> 
> 
> 
>         5. what client library do you use to access Cassandra?
>           (Hector?).  Is
>         your client code single threaded?
> 
> 
>     Hector - yes, the processing side of the client is single threaded,
>     but is largely waiting for cassandra responses and has plenty of CPU
>     headroom.
> 
> 
>     I guess what I'm most interested in is why the discrepancy in
>     between read/write latency - although I understand the data volume
>     is much larger in reads, even though the request rate is lower.
> 
>     Network usage on a cassandra box barely gets above 20Mbit, including
>     inter-cluster comms. Averages 5mbit client<>cassandra
> 
>     There is near zero disk I/O, and what little there is is served sub
>     1ms. Storage is backed by a very fast SAN, but like I said earlier,
>     the dataset just about fits in the Linux disk cache. 2Gb VM, 512Mb
>     cassandra heap - GCs are nice and quick, no JVM memory problems,
>     used heap oscillates between 280-350Mb.
> 
>     Basically, I'm just puzzled as cassandra doesn't behave as I would
>     expect. Huge CPU use in cassandra for very little throughput. I'm
>     struggling to find anything that's wrong with the environment,
>     there's no bottleneck that I can see.
> 
>     thanks
> 
>     James M
> 
> 
> 
> 
> 
>         On Fri, Dec 21, 2012 at 7:27 AM, James Masson
>         <james.masson@opigram.com <mailto:james.masson@opigram.com>
>         <mailto:james.masson@opigram.__com
> 
>         <mailto:james.masson@opigram.com>>> wrote:
> 
> 
>              Hi list-users,
> 
>              We have an application that has a relatively unusual access
>         pattern
>              in cassandra 1.1.6
> 
>              Essentially we read an entire multi hundred megabyte column
>         family
>              sequentially (little chance of a cassandra cache hit),
>         perform some
>              operations on the data, and write the data back to another
>         column
>              family in the same keyspace.
> 
>              We do about 250 writes/sec and 100 reads/sec during this
>         process.
>              Write request latency is about 900 microsecs, read request
>         latency
>              is about 4000 microsecs.
> 
>              * First Question: Do these numbers make sense?
> 
>              read-request latency seems a little high to me, cassandra
>         hasn't had
>              a chance to cache this data, but it's likely in the Linux disk
>              cache, given the sizing of the node/data/jvm.
> 
>              thanks
> 
>              James M
> 
> 
> 
> 


Mime
View raw message