Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@cassandra.apache.org
Received-SPF: pass (athena.apache.org: domain of sylvain@datastax.com
 designates 209.85.212.49 as permitted sender)
MIME-Version: 1.0
In-Reply-To: <6A6B434C-66F4-4D25-B55F-9AA5F8C1E17A@thelastpickle.com>
References: <1356106304813-7584412.post@n2.nabble.com>
	<1357641410224-7584620.post@n2.nabble.com>
	<6A6B434C-66F4-4D25-B55F-9AA5F8C1E17A@thelastpickle.com>
Date: Wed, 9 Jan 2013 08:24:20 +0100
Message-ID: 
 <CAKkz8Q0xcVyybhBNxM_x95tiYjw+PmkOAw50bJEBrg540Tbi5w@mail.gmail.com>
Subject: Re: Cassandra counters replication uses more traffic than client
 increments?
From: Sylvain Lebresne <sylvain@datastax.com>
To: "user@cassandra.apache.org" <user@cassandra.apache.org>
Content-Type: multipart/alternative; boundary=089e01182b266cd10a04d2d5f328

--089e01182b266cd10a04d2d5f328
Content-Type: text/plain; charset=ISO-8859-1

Since you're asking about counters, I'll note too that the internal
representation of counters is pretty fat. In you RF=2 case, each counter is
probably about 64 bytes internally, while on the client side you send only
a 8 bytes value for each increment. So I don't think there is anything
unexpected in having more traffic server to server than client to client.

--
Sylvain


On Wed, Jan 9, 2013 at 3:11 AM, aaron morton <aaron@thelastpickle.com>wrote:

> Can you measure the incoming client traffic on the nodes in DC 1 on port
> 9160 ? That would be more of an Apples to Apples comparison.
>
> I've taken a look at some of the captured packets and it looks like
> there's much more service information in DC-to-DC traffic compared to
>
> client-to-server traffic -- although I am by no means certain here.
>
> In addition to writes the the potential sources of cross DC traffic are
> Gossip and Repair. Gossip is pretty light weight (for a 4 node cluster) and
> repair only happens if you ask it to. There could also be hints delivered
> from DC 1 to DC 2, these would show up in the logs on DC1.
>
> Of the top of my head the Internal RowMutation serialisation is not too
> different to the Thrift mutation messages.
>
> There is also a message header, it includes: Source IP, an int for the
> verb, some overhead for the key/values, the string FORWARD and the
> forwarding IP address.
>
> Compare this to a mutation message: keyspace name, row key, column family
> ID (int), column name, value + list/hash overhead.
>
> So for small single column updates the ratio of overhead to payload is
> kind of high.
>
> - Is it indeed the case that server-to-server replication traffic can be
> significantly more bloated than client-to-server traffic? Or do I need to
> review my testing methodology?
>
> The meta data on the inter node messages is pretty static, the bigger the
> payloads the lower the ratio of overhead to payload. This is the same as
> messages that go between nodes within the same DC.
>
> - Is there anything that can be done to reduce cross-DC replication
> traffic? Perhaps some compression scheme?
>
> fixed in 1.2
> https://issues.apache.org/jira/browse/CASSANDRA-3127?attachmentOrder=desc
>
> Cheers
>
>
> -----------------
> Aaron Morton
> Freelance Cassandra Developer
> New Zealand
>
> @aaronmorton
> http://www.thelastpickle.com
>
> On 8/01/2013, at 11:36 PM, Sergey Olefir <solf.lists@gmail.com> wrote:
>
> So with the holidays hopefully being over, I thought I'd ask again :)
>
> Could someone please help with answers to the two questions:
> - Is it reasonable to expect that cross-datacenter node-to-node replication
> traffic is greater than actual client-to-server traffic that generates this
> activity? Specifically talking about counter increments.
> - Is there anything that can be done to lower the amount of
> cross-datacenter
> replication traffic while keeping actual replication going (i.e. we can't
> afford to not replicate data, but we can afford e.g. delays in
> replication)?
>
> Best regards,
> Sergey
>
>
> Sergey Olefir wrote
>
> Hi,
>
> as part of our ongoing tests with Cassandra, we've tried to evaluate the
> amount of traffic generated in client-to-server and server-to-server
> (replication scenarios).
>
> The results we are getting are surprising.
>
> Our setup:
> - Cassandra 1.1.7.
> - 3 DC with 2 nodes each.
> - NetworkTopology replication strategy with 2 replicas per DC (so
> basically each node contains full data set).
> - 100 clients concurrently incrementing counters at the rate of the
> roughly 100 / second (i.e. about 10k increments per second). Clients
> perform writes to DC:1 only. server-to-server traffic measurement was done
> in DC:2.
> - Clients use batches to write to the server (up to 100 increments per
> batch, overall each client writes 1 or 2 batches per second).
>
> Clients are Java-based accessing Cassandra via hector. Run on Windows box.
>
> Traffic measurement for clients (on Windows) was done via Resource Monitor
> and packet capture via Network Monitor. The overall traffic appears to be
> roughly 700KB/sec (kilobytes) for ~10000 increments).
>
> Traffic measurement for server-to-server was done on DC:2 via packet
> capture. This capture specifically included only nodes in other
> datacenters (so no internal DC traffic was captured).
>
> The vast majority of traffic was directed to one node DC:2-1. DC2-2
> received like 1/30 of the traffic. I think I've read somewhere that
> Cassandra directs DC-to-DC traffic to one node, so this makes sense.
>
> What is surprising though -- is the amount of traffic. It looks to be
> roughly twice the amount of the total traffic generated by clients, i.e.
> something like 1.5MB/sec (megabytes). Note -- this only counts incoming
> traffic.
>
> I've taken a look at some of the captured packets and it looks like
> there's much more service information in DC-to-DC traffic compared to
> client-to-server traffic -- although I am by no means certain here.
>
>
> Overall I have a couple of questions:
> - Is it indeed the case that server-to-server replication traffic can be
> significantly more bloated than client-to-server traffic? Or do I need to
> review my testing methodology?
> - Is there anything that can be done to reduce cross-DC replication
> traffic? Perhaps some compression scheme? Or some delay before replication
> allowing for possibly more increments to be merged together?
>
>
> Best regards,
> Sergey
>
>
>
>
>
>
> --
> View this message in context:
> http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Cassandra-counters-replication-uses-more-traffic-than-client-increments-tp7584412p7584620.html
> Sent from the cassandra-user@incubator.apache.org mailing list archive at
> Nabble.com.
>
>
>

--089e01182b266cd10a04d2d5f328
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr">Since you&#39;re asking about counters, I&#39;ll note too =
that the internal representation of counters is pretty fat. In you RF=3D2 c=
ase, each counter is probably about 64 bytes internally, while on the clien=
t side you send only a 8 bytes value for each increment. So I don&#39;t thi=
nk there is anything unexpected in having more traffic server to server tha=
n client to client.<div>
<br></div><div>--</div><div style>Sylvain</div></div><div class=3D"gmail_ex=
tra"><br><br><div class=3D"gmail_quote">On Wed, Jan 9, 2013 at 3:11 AM, aar=
on morton <span dir=3D"ltr">&lt;<a href=3D"mailto:aaron@thelastpickle.com" =
target=3D"_blank">aaron@thelastpickle.com</a>&gt;</span> wrote:<br>
<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex"><div style=3D"word-wrap:break-word">Can you =
measure the incoming client traffic on the nodes in DC 1 on port 9160 ?=A0T=
hat would be more of an Apples to Apples comparison.=A0<div class=3D"im">
<div><br></div><blockquote type=3D"cite"><blockquote type=3D"cite">I&#39;ve=
 taken a look at some of the captured packets and it looks like<br>there=
9;s much more service information in DC-to-DC traffic compared to<br></bloc=
kquote>
</blockquote><div><blockquote type=3D"cite"><blockquote type=3D"cite">clien=
t-to-server traffic -- although I am by no means certain here.</blockquote>=
</blockquote></div></div><div>In addition to writes the the potential sourc=
es of cross DC traffic are Gossip and Repair. Gossip is pretty light weight=
 (for a 4 node cluster) and repair only happens if you ask it to. There cou=
ld also be hints delivered from DC 1 to DC 2, these would show up in the lo=
gs on DC1.</div>
<div><br></div><div>Of the top of my head the Internal RowMutation serialis=
ation is not too different to the Thrift mutation messages. =A0=A0</div><di=
v><br></div><div>There is also a message header, it includes: Source IP, an=
 int for the verb, some overhead for the key/values, the string FORWARD and=
 the forwarding IP address.=A0</div>
<div><br></div><div>Compare this to a mutation message: keyspace name, row =
key, column family ID (int), column name, value + list/hash overhead.</div>=
<div><br></div><div>So for small single column updates the ratio of overhea=
d to payload is kind of high.=A0</div>
<div class=3D"im"><div><br></div><div><blockquote type=3D"cite"><blockquote=
 type=3D"cite">- Is it indeed the case that server-to-server replication tr=
affic can be<br>significantly more bloated than client-to-server traffic? O=
r do I need to<br>
review my testing methodology?</blockquote></blockquote></div></div><div>Th=
e meta data on the inter node messages is pretty static, the bigger the pay=
loads the lower the ratio of overhead to payload. This is the same as messa=
ges that go between nodes within the same DC.=A0</div>
<div><br></div><div><div class=3D"im"><blockquote type=3D"cite"><blockquote=
 type=3D"cite">- Is there anything that can be done to reduce cross-DC repl=
ication<br>traffic? Perhaps some compression scheme? </blockquote></blockqu=
ote>
</div>fixed in 1.2<br><a href=3D"https://issues.apache.org/jira/browse/CASS=
ANDRA-3127?attachmentOrder=3Ddesc" target=3D"_blank">https://issues.apache.=
org/jira/browse/CASSANDRA-3127?attachmentOrder=3Ddesc</a></div><div><br></d=
iv><div>
Cheers</div><div><br></div><div><br><div><div>
<div style=3D"text-indent:0px;letter-spacing:normal;font-variant:normal;tex=
t-align:-webkit-auto;font-style:normal;font-weight:normal;line-height:norma=
l;text-transform:none;font-size:medium;white-space:normal;font-family:Helve=
tica;word-wrap:break-word;word-spacing:0px">
<span style=3D"border-spacing:0px;text-indent:0px;letter-spacing:normal;fon=
t-variant:normal;text-align:-webkit-auto;font-style:normal;font-weight:norm=
al;line-height:normal;border-collapse:separate;text-transform:none;font-siz=
e:medium;white-space:normal;font-family:Helvetica;word-spacing:0px"><div st=
yle=3D"word-wrap:break-word">
<span style=3D"border-spacing:0px;text-indent:0px;letter-spacing:normal;fon=
t-variant:normal;font-style:normal;font-weight:normal;line-height:normal;bo=
rder-collapse:separate;text-transform:none;font-size:medium;white-space:nor=
mal;font-family:Helvetica;word-spacing:0px"><div style=3D"word-wrap:break-w=
ord">
<span style=3D"border-spacing:0px;text-indent:0px;letter-spacing:normal;fon=
t-variant:normal;font-style:normal;font-weight:normal;line-height:normal;bo=
rder-collapse:separate;text-transform:none;font-size:medium;white-space:nor=
mal;font-family:Helvetica;word-spacing:0px"><div style=3D"word-wrap:break-w=
ord">
<span style=3D"border-spacing:0px;text-indent:0px;letter-spacing:normal;fon=
t-variant:normal;font-style:normal;font-weight:normal;line-height:normal;bo=
rder-collapse:separate;text-transform:none;font-size:medium;white-space:nor=
mal;font-family:Helvetica;word-spacing:0px"><div style=3D"word-wrap:break-w=
ord">
<div>-----------------</div><div>Aaron Morton</div><div>Freelance Cassandra=
 Developer</div><div>New Zealand</div><div><br></div><div>@aaronmorton</div=
><div><a href=3D"http://www.thelastpickle.com" target=3D"_blank">http://www=
.thelastpickle.com</a></div>
</div></span></div></span></div></span></div></span></div>
</div><div><div class=3D"h5">

<br><div><div>On 8/01/2013, at 11:36 PM, Sergey Olefir &lt;<a href=3D"mailt=
o:solf.lists@gmail.com" target=3D"_blank">solf.lists@gmail.com</a>&gt; wrot=
e:</div><br><blockquote type=3D"cite">So with the holidays hopefully being =
over, I thought I&#39;d ask again :)<br>
<br>Could someone please help with answers to the two questions:<br>- Is it=
 reasonable to expect that cross-datacenter node-to-node replication<br>tra=
ffic is greater than actual client-to-server traffic that generates this<br=
>
activity? Specifically talking about counter increments.<br>- Is there anyt=
hing that can be done to lower the amount of cross-datacenter<br>replicatio=
n traffic while keeping actual replication going (i.e. we can&#39;t<br>
afford to not replicate data, but we can afford e.g. delays in replication)=
?<br><br>Best regards,<br>Sergey<br><br><br>Sergey Olefir wrote<br><blockqu=
ote type=3D"cite">Hi,<br><br>as part of our ongoing tests with Cassandra, w=
e&#39;ve tried to evaluate the<br>
amount of traffic generated in client-to-server and server-to-server<br>(re=
plication scenarios).<br><br>The results we are getting are surprising.<br>=
<br>Our setup:<br>- Cassandra 1.1.7.<br>- 3 DC with 2 nodes each.<br>- Netw=
orkTopology replication strategy with 2 replicas per DC (so<br>
basically each node contains full data set).<br>- 100 clients concurrently =
incrementing counters at the rate of the<br>roughly 100 / second (i.e. abou=
t 10k increments per second). Clients<br>perform writes to DC:1 only. serve=
r-to-server traffic measurement was done<br>
in DC:2.<br>- Clients use batches to write to the server (up to 100 increme=
nts per<br>batch, overall each client writes 1 or 2 batches per second).<br=
><br>Clients are Java-based accessing Cassandra via hector. Run on Windows =
box.<br>
<br>Traffic measurement for clients (on Windows) was done via Resource Moni=
tor<br>and packet capture via Network Monitor. The overall traffic appears =
to be<br>roughly 700KB/sec (kilobytes) for ~10000 increments).<br><br>Traff=
ic measurement for server-to-server was done on DC:2 via packet<br>
capture. This capture specifically included only nodes in other<br>datacent=
ers (so no internal DC traffic was captured).<br><br>The vast majority of t=
raffic was directed to one node DC:2-1. DC2-2<br>received like 1/30 of the =
traffic. I think I&#39;ve read somewhere that<br>
Cassandra directs DC-to-DC traffic to one node, so this makes sense.<br><br=
>What is surprising though -- is the amount of traffic. It looks to be<br>r=
oughly twice the amount of the total traffic generated by clients, i.e.<br>
something like 1.5MB/sec (megabytes). Note -- this only counts incoming<br>=
traffic.<br><br>I&#39;ve taken a look at some of the captured packets and i=
t looks like<br>there&#39;s much more service information in DC-to-DC traff=
ic compared to<br>
client-to-server traffic -- although I am by no means certain here.<br><br>=
<br>Overall I have a couple of questions:<br>- Is it indeed the case that s=
erver-to-server replication traffic can be<br>significantly more bloated th=
an client-to-server traffic? Or do I need to<br>
review my testing methodology?<br>- Is there anything that can be done to r=
educe cross-DC replication<br>traffic? Perhaps some compression scheme? Or =
some delay before replication<br>allowing for possibly more increments to b=
e merged together?<br>
<br><br>Best regards,<br>Sergey<br></blockquote><br><br><br><br><br>--<br>V=
iew this message in context: <a href=3D"http://cassandra-user-incubator-apa=
che-org.3065146.n2.nabble.com/Cassandra-counters-replication-uses-more-traf=
fic-than-client-increments-tp7584412p7584620.html" target=3D"_blank">http:/=
/cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Cassandra-counte=
rs-replication-uses-more-traffic-than-client-increments-tp7584412p7584620.h=
tml</a><br>
Sent from the <a href=3D"mailto:cassandra-user@incubator.apache.org" target=
=3D"_blank">cassandra-user@incubator.apache.org</a> mailing list archive at=
 <a href=3D"http://Nabble.com" target=3D"_blank">Nabble.com</a>.<br></block=
quote>
</div><br></div></div></div></div></div></blockquote></div><br></div>

--089e01182b266cd10a04d2d5f328--