cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Gianluca Borello <gianl...@sysdig.com>
Subject Re: Unexpected high internode network activity
Date Fri, 26 Feb 2016 04:12:11 GMT
Thank you for your reply.

To answer your points:

- I fully agree on the write volume, in fact my isolated tests confirm
your estimation

- About the read, I agree as well, but the volume of data is still much
higher

- I am writing to one single keyspace with RF 3, there's just one keyspace

- I am not using any indexes, the column families are very simple

- I am aware of the double count, in fact, I measured the traffic on port
9042 at the client side (so just counted once) and I divided by two the
traffic on port 7000 as measured on each node (35 GB -> 17.5 GB). All the
measurements have been done with iftop with proper bpf filters on the
port and the total traffic matches what I see in cloudwatch (divided by two)

So unfortunately I still don't have any ideas about what's going on and why
I'm seeing 17 GB of internode traffic instead of ~ 5-6.

On Thursday, February 25, 2016, daemeon reiydelle <daemeonr@gmail.com>
wrote:

> If read & write at quorum then you write 3 copies of the data then return
> to the caller; when reading you read one copy (assume it is not on the
> coordinator), and 1 digest (because read at quorum is 2, not 3).
>
> When you insert, how many keyspaces get written to? (Are you using e.g.
> inverted indices?) That is my guess, that your db has about 1.8 bytes
> written for every byte inserted.
>
> ​Every byte you write is counted also as a read (system a sends 1gb to
> system b, so system b receives 1gb). You would not be charged if intra AZ,
> but inter AZ and inter DC will get that double count.
>
> So, my guess is reverse indexes, and you forgot to include receive and
> transmit.​
> ​
>
>
> *.......*
>
>
>
> *Daemeon C.M. ReiydelleUSA (+1) 415.501.0198London (+44) (0) 20 8144 9872*
>
> On Thu, Feb 25, 2016 at 6:51 PM, Gianluca Borello <gianluca@sysdig.com
> <javascript:_e(%7B%7D,'cvml','gianluca@sysdig.com');>> wrote:
>
>> Hello,
>>
>> We have a Cassandra 2.1.9 cluster on EC2 for one of our live
>> applications. There's a total of 21 nodes across 3 AWS availability zones,
>> c3.2xlarge instances.
>>
>> The configuration is pretty standard, we use the default settings that
>> come with the datastax AMI and the driver in our application is configured
>> to use lz4 compression. The keyspace where all the activity happens has RF
>> 3 and we read and write at quorum to get strong consistency.
>>
>> While analyzing our monthly bill, we noticed that the amount of network
>> traffic related to Cassandra was significantly higher than expected. After
>> breaking it down by port, it seems like over any given time, the internode
>> network activity is 6-7 times higher than the traffic on port 9042, whereas
>> we would expect something around 2-3 times, given the replication factor
>> and the consistency level of our queries.
>>
>> For example, this is the network traffic broken down by port and
>> direction over a few minutes, measured as sum of each node:
>>
>> Port 9042 from client to cluster (write queries): 1 GB
>> Port 9042 from cluster to client (read queries): 1.5 GB
>> Port 7000: 35 GB, which must be divided by two because the traffic is
>> always directed to another instance of the cluster, so that makes it 17.5
>> GB generated traffic
>>
>> The traffic on port 9042 completely matches our expectations, we do about
>> 100k write operations writing 10KB binary blobs for each query, and a bit
>> more reads on the same data.
>>
>> According to our calculations, in the worst case, when the coordinator of
>> the query is not a replica for the data, this should generate about (1 +
>> 1.5) * 3 = 7.5 GB, and instead we see 17 GB, which is quite a lot more.
>>
>> Also, hinted handoffs are disabled and nodes are healthy over the period
>> of observation, and I get the same numbers across pretty much every time
>> window, even including an entire 24 hours period.
>>
>> I tried to replicate this problem in a test environment so I connected a
>> client to a test cluster done in a bunch of Docker containers (same
>> parameters, essentially the only difference is the
>> GossipingPropertyFileSnitch instead of the EC2 one) and I always get what I
>> expect, the amount of traffic on port 7000 is between 2 and 3 times the
>> amount of traffic on port 9042 and the queries are pretty much the same
>> ones.
>>
>> Before doing more analysis, I was wondering if someone has an explanation
>> on this problem, since perhaps we are missing something obvious here?
>>
>> Thanks
>>
>>
>>
>

Mime
View raw message