Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@cassandra.apache.org
MIME-Version: 1.0
In-Reply-To: 
 <CAOUOv0GSoF1XSOaqSxGgwAw=3Z5q0mNa8wFTdTEinxdyDchWbw@mail.gmail.com>
References: 
 <CAJjpQySWFYJJ2UWPbkxyWBEDgXDQQN=dQ6KS=c72xchuhybJ0Q@mail.gmail.com>
	<CAOUOv0GSoF1XSOaqSxGgwAw=3Z5q0mNa8wFTdTEinxdyDchWbw@mail.gmail.com>
Date: Thu, 25 Feb 2016 20:12:11 -0800
Message-ID: 
 <CAJjpQyT1y2F6aPc_VZfPzTNbty1WZrpyQf1kceeov-bsD-3GGQ@mail.gmail.com>
Subject: Re: Unexpected high internode network activity
From: Gianluca Borello <gianluca@sysdig.com>
To: "user@cassandra.apache.org" <user@cassandra.apache.org>
Content-Type: multipart/alternative; boundary=089e0102f33ae1b0cd052ca480b8

--089e0102f33ae1b0cd052ca480b8
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

Thank you for your reply.

To answer your points:

- I fully agree on the write volume, in fact my isolated tests confirm
your estimation

- About the read, I agree as well, but the volume of data is still much
higher

- I am writing to one single keyspace with RF 3, there's just one keyspace

- I am not using any indexes, the column families are very simple

- I am aware of the double count, in fact, I measured the traffic on port
9042 at the client side (so just counted once) and I divided by two the
traffic on port 7000 as measured on each node (35 GB -> 17.5 GB). All the
measurements have been done with iftop with proper bpf filters on the
port and the total traffic matches what I see in cloudwatch (divided by two=
)

So unfortunately I still don't have any ideas about what's going on and why
I'm seeing 17 GB of internode traffic instead of ~ 5-6.

On Thursday, February 25, 2016, daemeon reiydelle <daemeonr@gmail.com>
wrote:

> If read & write at quorum then you write 3 copies of the data then return
> to the caller; when reading you read one copy (assume it is not on the
> coordinator), and 1 digest (because read at quorum is 2, not 3).
>
> When you insert, how many keyspaces get written to? (Are you using e.g.
> inverted indices?) That is my guess, that your db has about 1.8 bytes
> written for every byte inserted.
>
> =E2=80=8BEvery byte you write is counted also as a read (system a sends 1=
gb to
> system b, so system b receives 1gb). You would not be charged if intra AZ=
,
> but inter AZ and inter DC will get that double count.
>
> So, my guess is reverse indexes, and you forgot to include receive and
> transmit.=E2=80=8B
> =E2=80=8B
>
>
> *.......*
>
>
>
> *Daemeon C.M. ReiydelleUSA (+1) 415.501.0198London (+44) (0) 20 8144 9872=
*
>
> On Thu, Feb 25, 2016 at 6:51 PM, Gianluca Borello <gianluca@sysdig.com
> <javascript:_e(%7B%7D,'cvml','gianluca@sysdig.com');>> wrote:
>
>> Hello,
>>
>> We have a Cassandra 2.1.9 cluster on EC2 for one of our live
>> applications. There's a total of 21 nodes across 3 AWS availability zone=
s,
>> c3.2xlarge instances.
>>
>> The configuration is pretty standard, we use the default settings that
>> come with the datastax AMI and the driver in our application is configur=
ed
>> to use lz4 compression. The keyspace where all the activity happens has =
RF
>> 3 and we read and write at quorum to get strong consistency.
>>
>> While analyzing our monthly bill, we noticed that the amount of network
>> traffic related to Cassandra was significantly higher than expected. Aft=
er
>> breaking it down by port, it seems like over any given time, the interno=
de
>> network activity is 6-7 times higher than the traffic on port 9042, wher=
eas
>> we would expect something around 2-3 times, given the replication factor
>> and the consistency level of our queries.
>>
>> For example, this is the network traffic broken down by port and
>> direction over a few minutes, measured as sum of each node:
>>
>> Port 9042 from client to cluster (write queries): 1 GB
>> Port 9042 from cluster to client (read queries): 1.5 GB
>> Port 7000: 35 GB, which must be divided by two because the traffic is
>> always directed to another instance of the cluster, so that makes it 17.=
5
>> GB generated traffic
>>
>> The traffic on port 9042 completely matches our expectations, we do abou=
t
>> 100k write operations writing 10KB binary blobs for each query, and a bi=
t
>> more reads on the same data.
>>
>> According to our calculations, in the worst case, when the coordinator o=
f
>> the query is not a replica for the data, this should generate about (1 +
>> 1.5) * 3 =3D 7.5 GB, and instead we see 17 GB, which is quite a lot more=
.
>>
>> Also, hinted handoffs are disabled and nodes are healthy over the period
>> of observation, and I get the same numbers across pretty much every time
>> window, even including an entire 24 hours period.
>>
>> I tried to replicate this problem in a test environment so I connected a
>> client to a test cluster done in a bunch of Docker containers (same
>> parameters, essentially the only difference is the
>> GossipingPropertyFileSnitch instead of the EC2 one) and I always get wha=
t I
>> expect, the amount of traffic on port 7000 is between 2 and 3 times the
>> amount of traffic on port 9042 and the queries are pretty much the same
>> ones.
>>
>> Before doing more analysis, I was wondering if someone has an explanatio=
n
>> on this problem, since perhaps we are missing something obvious here?
>>
>> Thanks
>>
>>
>>
>

--089e0102f33ae1b0cd052ca480b8
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

Thank you for your reply.<div><br></div><div>To answer your points:</div><d=
iv><br></div><div>- I fully agree on the write volume, in fact my isolated=
=C2=A0tests confirm your=C2=A0estimation</div><div><br></div><div>- About t=
he read, I agree as well, but the volume of data is still much higher</div>=
<div><br></div><div>- I am writing to one single keyspace with RF 3, there&=
#39;s just one keyspace=C2=A0</div><div><br></div><div>- I am not using any=
 indexes, the column families are very simple</div><div><br></div><div>- I =
am aware of the double count, in fact, I measured the traffic on port 9042 =
at the client side (so just counted once) and I divided by two the traffic =
on port 7000 as measured on each node (35 GB -&gt; 17.5 GB). All the measur=
ements have been done with iftop with proper bpf filters on the port=C2=A0a=
nd the total traffic matches what I see in cloudwatch (divided by two)<span=
></span></div><div><br></div><div>So unfortunately I still don&#39;t have a=
ny ideas about what&#39;s going on and why I&#39;m seeing 17 GB of internod=
e traffic instead of ~ 5-6.=C2=A0<br><br>On Thursday, February 25, 2016, da=
emeon reiydelle &lt;<a href=3D"mailto:daemeonr@gmail.com">daemeonr@gmail.co=
m</a>&gt; wrote:<br><blockquote class=3D"gmail_quote" style=3D"margin:0 0 0=
 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir=3D"ltr"><div cl=
ass=3D"gmail_default" style=3D"font-family:comic sans ms,sans-serif;color:r=
gb(7,55,99)">If read &amp; write at quorum then you write 3 copies of the d=
ata then return to the caller; when reading you read one copy (assume it is=
 not on the coordinator), and 1 digest (because read at quorum is 2, not 3)=
. <br><br></div><div class=3D"gmail_default" style=3D"font-family:comic san=
s ms,sans-serif;color:rgb(7,55,99)">When you insert, how many keyspaces get=
 written to? (Are you using e.g. inverted indices?) That is my guess, that =
your db has about 1.8 bytes written for every byte inserted.<br></div><div =
class=3D"gmail_default" style=3D"font-family:comic sans ms,sans-serif;color=
:rgb(7,55,99)"><br></div><div class=3D"gmail_default" style=3D"font-family:=
comic sans ms,sans-serif;color:rgb(7,55,99);display:inline">=E2=80=8BEvery =
byte you write is counted also as a read (system a sends 1gb to system b, s=
o system b receives 1gb). You would not be charged if intra AZ, but inter A=
Z and inter DC will get that double count.<br><br></div><div class=3D"gmail=
_default" style=3D"font-family:comic sans ms,sans-serif;color:rgb(7,55,99);=
display:inline">So, my guess is reverse indexes, and you forgot to include =
receive and transmit.=E2=80=8B</div>=E2=80=8B</div><div class=3D"gmail_extr=
a"><br clear=3D"all"><div><div><div dir=3D"ltr"><div><div dir=3D"ltr"><div>=
<div dir=3D"ltr"><span style=3D"color:rgb(56,118,29)"><span style=3D"backgr=
ound-color:rgb(255,255,255)"><b><span style=3D"font-family:comic sans ms,sa=
ns-serif"></span></b></span></span><span style=3D"color:rgb(56,118,29)"><sp=
an style=3D"background-color:rgb(255,255,255)"><b><span style=3D"font-famil=
y:comic sans ms,sans-serif"><br>.......</span></b></span></span><span style=
=3D"color:rgb(56,118,29)"><span style=3D"background-color:rgb(255,255,255)"=
><b><span style=3D"font-family:comic sans ms,sans-serif"><br><br>Daemeon C.=
M. Reiydelle<br>USA (+1) 415.501.0198<br>London (+44) (0) 20 8144 9872</spa=
n></b></span></span><font size=3D"1"><i><br></i></font></div></div></div></=
div></div></div></div>
<br><div class=3D"gmail_quote">On Thu, Feb 25, 2016 at 6:51 PM, Gianluca Bo=
rello <span dir=3D"ltr">&lt;<a href=3D"javascript:_e(%7B%7D,&#39;cvml&#39;,=
&#39;gianluca@sysdig.com&#39;);" target=3D"_blank">gianluca@sysdig.com</a>&=
gt;</span> wrote:<br><blockquote class=3D"gmail_quote" style=3D"margin:0 0 =
0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir=3D"ltr"><div>H=
ello,</div><div><br></div><div>We have a Cassandra 2.1.9 cluster on EC2 for=
 one of our live applications. There&#39;s a total of 21 nodes across 3 AWS=
 availability zones, c3.2xlarge instances.</div><div><br></div><div>The con=
figuration is pretty standard, we use the default settings that come with t=
he datastax AMI and the driver in our application is configured to use lz4 =
compression. The keyspace where all the activity happens has RF 3 and we re=
ad and write at quorum to get strong consistency.</div><div><br></div><div>=
While analyzing our monthly bill, we noticed that the amount of network tra=
ffic related to Cassandra was significantly higher than expected. After bre=
aking it down by port, it seems like over any given time, the internode net=
work activity is 6-7 times higher than the traffic on port 9042, whereas we=
 would expect something around 2-3 times, given the replication factor and =
the consistency level of our queries.</div><div><br></div><div>For example,=
 this is the network traffic broken down by port and direction over a few m=
inutes, measured as sum of each node:</div><div><br></div><div>Port 9042 fr=
om client to cluster (write queries): 1 GB</div><div>Port 9042 from cluster=
 to client (read queries): 1.5 GB</div><div>Port 7000: 35 GB, which must be=
 divided by two because the traffic is always directed to another instance =
of the cluster, so that makes it 17.5 GB generated traffic</div><div><br></=
div><div>The traffic on port 9042 completely matches our expectations, we d=
o about 100k write operations writing 10KB binary blobs for each query, and=
 a bit more reads on the same data.</div><div><br></div><div>According to o=
ur calculations, in the worst case, when the coordinator of the query is no=
t a replica for the data, this should generate about (1 + 1.5) * 3 =3D 7.5 =
GB, and instead we see 17 GB, which is quite a lot more.</div><div><br></di=
v><div>Also, hinted handoffs are disabled and nodes are healthy over the pe=
riod of observation, and I get the same numbers across pretty much every ti=
me window, even including an entire 24 hours period.</div><div><br></div><d=
iv>I tried to replicate this problem in a test environment so I connected a=
 client to a test cluster done in a bunch of Docker containers (same parame=
ters, essentially the only difference is the GossipingPropertyFileSnitch in=
stead of the EC2 one) and I always get what I expect, the amount of traffic=
 on port 7000 is between 2 and 3 times the amount of traffic on port 9042 a=
nd the queries are pretty much the same ones.</div><div><br></div><div>Befo=
re doing more analysis, I was wondering if someone has an explanation on th=
is problem, since perhaps we are missing something obvious here?</div><div>=
<br></div><div>Thanks</div><div><br></div><div><br></div></div>
</blockquote></div><br></div>
</blockquote></div>

--089e0102f33ae1b0cd052ca480b8--