cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Alain RODRIGUEZ <arodr...@gmail.com>
Subject Re: Effect of adding/removing nodes from cassandra
Date Wed, 16 Sep 2015 08:58:55 GMT
Hi again Aadil,

I just had a quick look, as I don't really know the C* code, you can check
that as well as I do (maybe even better), but I found this, it might be a
good place to start having a look -->
https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/streaming/ConnectionHandler.java

Yet I am not a Cassandra commiter. maybe would you have more luck asking
there --> dev@cassandra.apache.org (to subscribe send a message to
dev-subscribe@cassandra.apache.org)

Plus it might be interesting for people there to see that bootstrap (and
streaming stuff) uses that much CPU.

My 2 cents,

C*heers,

Alain


2015-09-15 0:55 GMT+02:00 Ahamed, Aadil <aahamed@akamai.com>:

> Hi,
> I just wanted to resubmit this post in case it didn’t go through.
>
> I used the sjk jvm-tool: https://github.com/aragozin/jvm-tools
> <https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_aragozin_jvm-2Dtools&d=BQMF-g&c=96ZbZZcaMF4w0F4jpN6LZg&r=IPrliSk1Ijg-U4peH4IIW6Lcj4i6GMuxpxu3zDH-l0w&m=bZUdJplcHM5O0utN6k29Q7-osMmhK8tYqLUvl_xhTFg&s=rzUSlUgnPe1HeHl5iM4Uktrz2KExlXdNyNMmgcnQlus&e=>
to
> profile a Cassandra node being added to the cluster.
> I ran the profiler using this command: java –jar sjk.jar ttop –s
> localhost:7199 –o CPU –n 25
> Here is a snapshot of the profiler when CPU Usage was highest:
>
>
> As Alain hypothesized in a previous email handling the threads responsible
> for handling the streams take up quite a bit of CPU. Is there any way I can
> use this thread name STREAM-IN-/198.18.71.134 to figure out where it is
> implemented in the source code or why it is using so much CPU?
>
> Thanks for your patience,
> Aadil
>
> From: Aadil Ahamed
> Reply-To: "user@cassandra.apache.org"
> Date: Friday, September 11, 2015 at 11:28 AM
>
> To: "user@cassandra.apache.org"
> Subject: Re: Effect of adding/removing nodes from cassandra
>
> Hi,
>
> I used the sjk jvm-tool: https://github.com/aragozin/jvm-tools
> <https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_aragozin_jvm-2Dtools&d=BQMF-g&c=96ZbZZcaMF4w0F4jpN6LZg&r=IPrliSk1Ijg-U4peH4IIW6Lcj4i6GMuxpxu3zDH-l0w&m=bZUdJplcHM5O0utN6k29Q7-osMmhK8tYqLUvl_xhTFg&s=rzUSlUgnPe1HeHl5iM4Uktrz2KExlXdNyNMmgcnQlus&e=>
to
> profile a Cassandra node being added to the cluster.
> I ran the profiler using this command: java –jar sjk.jar ttop –s
> localhost:7199 –o CPU –n 25
> Here is a snapshot of the profiler when CPU Usage was highest:
>
>
> As Alain hypothesized in a previous email handling the threads responsible
> for handling the streams take up quite a bit of CPU. Is there any way I can
> use this thread name STREAM-IN-/198.18.71.134 to figure out where it is
> implemented in the source code or why it is using so much CPU?
>
> Thanks,
> Aadil
>
>
> From: Aadil Ahamed
> Reply-To: "user@cassandra.apache.org"
> Date: Tuesday, September 8, 2015 at 10:41 AM
> To: "user@cassandra.apache.org"
> Subject: Re: Effect of adding/removing nodes from cassandra
>
> Hi,
>
> I am actually very interested in taking a look at the source code. Are you
> aware of which file(s) are responsible for the bootstrap process?
>
> As for how I am measuring CPU & Network I/O:
> I am using an open source python package called psutil
> <https://urldefense.proofpoint.com/v2/url?u=https-3A__pythonhosted.org_psutil_&d=BQMF-g&c=96ZbZZcaMF4w0F4jpN6LZg&r=IPrliSk1Ijg-U4peH4IIW6Lcj4i6GMuxpxu3zDH-l0w&m=bZUdJplcHM5O0utN6k29Q7-osMmhK8tYqLUvl_xhTFg&s=rZEjxqj6GvMRb_vTnClwVDC4bHb1CecRqRhKsBIchR0&e=>.
> This package contains a library of functions for querying system related
> information like cpu and network I/O.
>
> Specifically I use these functions:
>  psutil.cpu_percent() - get average cpu utilization across all cores as a
> percentage
>  psutil.net_io_counters().bytes_sent – get number of bytes sent over the
> network
>  psutil.net_io_counters().bytes_recv  – get number of bytes received over
> the network
>
> Also, thanks for linking the jvm tool
> https://github.com/aragozin/jvm-tools
> <https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_aragozin_jvm-2Dtools&d=BQMF-g&c=96ZbZZcaMF4w0F4jpN6LZg&r=IPrliSk1Ijg-U4peH4IIW6Lcj4i6GMuxpxu3zDH-l0w&m=bZUdJplcHM5O0utN6k29Q7-osMmhK8tYqLUvl_xhTFg&s=rzUSlUgnPe1HeHl5iM4Uktrz2KExlXdNyNMmgcnQlus&e=>.
> It seems like it could be really helpful. I will run it and let you know of
> the results.
>
> Thanks,
> Aadil
>
> From: Edouard COLE <Edouard.COLE@rgsystem.com>
> Reply-To: "user@cassandra.apache.org" <user@cassandra.apache.org>
> Date: Monday, September 7, 2015 at 8:54 AM
> To: "user@cassandra.apache.org" <user@cassandra.apache.org>
> Subject: RE: Effect of adding/removing nodes from cassandra
>
> Hi,
>
> Could you show us how you measure CPU & network?
>
> Just to be sure we're not missing something obvious
>
> Thanks,
>
>
>
> De : Alain RODRIGUEZ [mailto:arodrime@gmail.com <arodrime@gmail.com>]
> Envoyé : Monday, September 07, 2015 5:19 PM
> À : user@cassandra.apache.org
> Objet : Re: Effect of adding/removing nodes from cassandra
>
> Hi, sorry I have been quite busy and was hopping that someone else could
> answer about CPU, as told I never went that deep.
>
> "But if there were in fact 5 nodes then the max throughput per node would
> be capped at 25/5 = 5MB/s"
> This is wrong imho. Stream are globally capped, independently of the total
> nimber of node, at any time. I mean, output of each node is capped to 200
> Mb, no matter you node number, so you can always reach this 200 Mb
> theoretically, even if streaming to one node.
> Yet I know that bandwidth is not often the bottleneck lately, something
> else is, not sure what, you might want to "watch" this issue
> https://issues.apache.org/jira/browse/CASSANDRA-9766
> <https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_CASSANDRA-2D9766&d=BQMF-g&c=96ZbZZcaMF4w0F4jpN6LZg&r=IPrliSk1Ijg-U4peH4IIW6Lcj4i6GMuxpxu3zDH-l0w&m=bZUdJplcHM5O0utN6k29Q7-osMmhK8tYqLUvl_xhTFg&s=yadwfhjjbxHQWUkBUccKWqJ-kyHQuKAcsLo_n25bf2Y&e=>
> .
>
> About cpu, if this is not due to Compactions then I have no ideas of what
> is happening exactly. Yet, you might want to use
> https://github.com/aragozin/jvm-tools
> <https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_aragozin_jvm-2Dtools&d=BQMF-g&c=96ZbZZcaMF4w0F4jpN6LZg&r=IPrliSk1Ijg-U4peH4IIW6Lcj4i6GMuxpxu3zDH-l0w&m=bZUdJplcHM5O0utN6k29Q7-osMmhK8tYqLUvl_xhTFg&s=rzUSlUgnPe1HeHl5iM4Uktrz2KExlXdNyNMmgcnQlus&e=>
> to find this out. Running something like "java -jar sjk-plus-0.3.6.jar ttop
> -s localhost:7199 -o CPU -n 30" would probably help.
>
> About “handling all the streams” it was to say something. I have no idea
> what happen in there but there a little work to handle streams and write
> data... Yet this should not be 20 %, unless your CPU are
> ridiculously inefficient / old imho.
>
> I have no knowledge of a documentation describing the bootstrap process. I
> use Datastax docs a lot (
> http://docs.datastax.com/en/cassandra/2.1/cassandra/gettingStartedCassandraIntro.html
> <https://urldefense.proofpoint.com/v2/url?u=http-3A__docs.datastax.com_en_cassandra_2.1_cassandra_gettingStartedCassandraIntro.html&d=BQMF-g&c=96ZbZZcaMF4w0F4jpN6LZg&r=IPrliSk1Ijg-U4peH4IIW6Lcj4i6GMuxpxu3zDH-l0w&m=bZUdJplcHM5O0utN6k29Q7-osMmhK8tYqLUvl_xhTFg&s=X78V1Bp5RTzK7I8ZR2aJMwstzlP_DlHPSxRCt2XKAIU&e=>)
> but I don't think they go that deep, it is more a user guide than a
> technical description. You might want to have a look at the code...
>
> Please keep us posted, your work is interesting (plus linked to some
> recent issues about over throttled bootstraps).
>
> C*heers,
>
> Alain
>
> 2015-09-02 22:33 GMT+02:00 Ahamed, Aadil <aahamed@akamai.com>:
> Hi Alain,
>
> Thank you for your prompt and detailed response. I found it very helpful.
> As for your questions:- I am interning at Akamai. I can’t say too much
> about the use case but we are looking at Cassandra as an option to serve
> large volumes of sustained data.
>
> My Cassandra version is:  2.1.7 (thanks for the tip)
>
> As a follow up to your response:
>
> Stream Throughput
> I ran nodetool getstreamthroughput and the output was 200Mb/s = 25 MB/s so
> it is weird that my throughput is capped at 5MB/s. However, I do have a
> guess for why this is occurring – Initially I had a 5 node cluster but I
> decommissioned 2 nodes for the purpose of this test. But if there were in
> fact 5 nodes then the max throughput per node would be capped at 25/5 =
> 5MB/s. Based on the issue you raised it does not look like Cassandra can
> adapt the per node streaming throughput (5MB/s) to take full advanatage of
> the total stream throughput (25 MB/s). This still raises the question: Why
> isn’t the per node stream throughput changed to 25/3 = 8.33MB/s after 2
> nodes are decommissioned from the 5 node cluster? Does it have something to
> do with  Cassandra expecting decommissioned nodes to rejoin the cluster
> eventually? Of course, my guess could be wrong and the problem could be a
> result of a completely different issue.
>
> CPU usage for Bootstrapping
> I have tried adding nodes with auto-compaction disabled (nodetool
> disableautocompaction) and it still results in similar amounts of cpu usage
> so I don’t think compaction is the main cause. In addition, you said that
>  handling all the stream also consume CPU. Could you explain a little more
> on what you mean by “handling all the streams”?. Also, if you know any
> sites/documents that have more in depth information about the bootstrap
> process please let me know.
>
> Thanks again for your help. I really appreciate it.
> Aadil
>
>
>
> From: Alain RODRIGUEZ <arodrime@gmail.com>
> Reply-To: "user@cassandra.apache.org" <user@cassandra.apache.org>
> Date: Tuesday, September 1, 2015 at 5:26 AM
> To: "user@cassandra.apache.org" <user@cassandra.apache.org>
> Subject: Re: Effect of adding/removing nodes from cassandra
>
> Hi Aadil, and welcome then !
> Graph for initial node 1, 2:
> 1. You can set "stream_throughput_outbound_megabits_per_sec" in
> cassandra.yaml configuration for permanent change. The point here is not to
> take all the bandwidth while adding a node to let transactional network to
> run as normally as possible. Also look at "nodetool setstreamthroughput"
> (and "getstreamthroughput") if you want to adjust during an operation
> without restarting the node. This will be overridden on node restart, back
> to the value  in the cassandra.yaml
> This limit the outgoing traffic, meaning that this will increase for each
> node you add, if using vnodes, on certain operations like repair,
> removenode, ... You might want to take this into consideration. I created a
> not very popular issue about this, at least I detail the issue there -->
> https://issues.apache.org/jira/browse/CASSANDRA-9509
> <https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_CASSANDRA-2D9509&d=BQMF-g&c=96ZbZZcaMF4w0F4jpN6LZg&r=IPrliSk1Ijg-U4peH4IIW6Lcj4i6GMuxpxu3zDH-l0w&m=bZUdJplcHM5O0utN6k29Q7-osMmhK8tYqLUvl_xhTFg&s=C7uRF8x-FbxnIB46zhjNo0YXGg9liM2sDKS6M8U5GS4&e=>.
> In your exemple see that you receive 2 * 5 = 10 MB/s, then one node
> finishes, and you keep receiving 5 MB/s.
> Though, default is 200 Mbps (25 MB) , not sure why you are limited to 5 MB
> there, unless you changed this or I miscalculated something, this use to
> happen to me :D... Also maybe something relative to
> https://issues.apache.org/jira/browse/CASSANDRA-9766
> <https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_CASSANDRA-2D9766&d=BQMF-g&c=96ZbZZcaMF4w0F4jpN6LZg&r=IPrliSk1Ijg-U4peH4IIW6Lcj4i6GMuxpxu3zDH-l0w&m=bZUdJplcHM5O0utN6k29Q7-osMmhK8tYqLUvl_xhTFg&s=yadwfhjjbxHQWUkBUccKWqJ-kyHQuKAcsLo_n25bf2Y&e=>
> (btw, ou might want to add your version of Cassandra when posting, it would
> be easier to point you to known bugs or to the right options etc)
> Graph for added node:
> 1. Adding a node is CPU intensive I guess it is mainly because you have to
> compact all the data you accumulated quite fast on the bootstrapping node
> (maybe there is more reason, other people might explain this better, I
> never needed to go that deep, I imagine handling all the stream also
> consume CPU)
> 2. You partially answered yourself, one node ended streaming. The other
> half of the question, about the reason why the other node don't send more
> data is explained above. Limitations are on outgoing traffic, so this looks
> normal to me.
> The only weird thing to me is the threshold of you streaming throughput,
> unless you changed it.
> Btw, I am very curious, please feel free not to answer this if you are
> under a NDA or whatever. Are you working for Akamai ? What's your use case
> for Cassandra ?
> Hope this will help !
> C*heers,
> Alain
>
> Hi,
>
> This is my first post to this mailing list so I want to apologize in
> advance if I break any rules or guidelines. I have also inserted images of
> graphs and I am not sure if they will show up. Please let me know if I can
> improve my post in any way.
>
> I am investigating the effect of adding and removing nodes from a
> Cassandra cluster by collecting metrics on cpu utilization, memory usage
> etc.
> Basically I want to have a quantitative measurement of how intensive the
> add/remove operation is so we know what to expect when
> increasing/decreasing capacity on our production cluster.
> I wrote a tool to measure this effect but I need help interpreting the
> data I have collected.
>
> Here is a sample testcase:
>
> Relevant Machine Information:
> Logical cores: 8
> Core model: Intel Xeon(R) CPU E31270 @ 3.4GHz
> RAM: 16GB
> Non-Volatile storage: Hard Disk
>
> Test Info:
> Keyspace: Simple strategy with replication factor 2
> Initial no. of nodes on cluster: 2 [198.18.71.132, 198.18.71.133]
> Initial load on each node: 30GB
> No. of added nodes: 1 [198.18.71.134]
> Final load on each node: 18GB
>
> *The x-axis for all graphs is in seconds
>
> Graph for initial node 1:
>
>
>
> Graph for initial node 2:
>
>
>
> Graph for added node:
>
>
>
> Here are my questions:
>
> Graph for initial node 1, 2:
>
> 1. Why is the KBytes/s sent over the network stay constant at around ~5000
> KB/s? Can I change it to something higher?
>
> Graph for added node:
>
> 1.  Why is the cpu usage so high?
> It is close to 30% for ~800 seconds and 10% for another 400 seconds. I did
> not expect the cpu usage to be this high and I did not expect it to stay
> high for such a long period of time. Based on my understanding the only cpu
> intensive process during bootstrap is token range recalculation. However,
> based on the graph the cpu usage seems to be proportional to the amount of
> data that is streamed to the node at any given moment in time.
>
> 2. Why does the Kbytes/s received over the network drop from - ~10000 ->
> ~5000  - at around 800 seconds?
> I can see that initial node 1 stops streaming data at around 800 seconds
> but why does initial node 2 not bump up its outgoing rate of transfer?
>
>
> Thanks for your help,
> Aadil
>
>
>
>

Mime
View raw message