Hello Kevin, here are a few thoughts and things you could try:
each node has stored about 4TB
When I joined a new node, I found that the process has not been completed for more than a week
So first thought is that this is a lot of data per node. I think the sweet spot is around 1-2 TB / node. That being said, since 8 TB are available, this should not be an issue as long as you are ready to wait days to have data moving around (repairs / bootstrap / etc). Also, modern versions of Apache Cassandra aim at increasing the amount of data per node allowed by reducing operation time and managing memory (among other things) better and better. Also, I have worked with perfectly healthy nodes with 4 TB of data. So it's nothing completely undoable.
By default in your version you should have 'streaming_socket_timeout_in_ms: 86400000', if it is set to zero for some reason, set it back to 1 day (86400000 ms). If it is already set to one day and you have some 'VERY LARGE'™ (like 1+ TB) SSTables you might want to try setting this value to 2 days or so. You should be able to find trace of streaming failure and retry in the logs if that's the issue you are facing.
CPU load of new node and some other nodes continued to be high
That's normal for new nodes, they are receiving tons of data that needs to be compacted. Yet as long as they are not showing 'UN' in 'nodetool status' they are not serving read and so not creating any latency. No worries about that. On old nodes, it's interesting to dig what CPUs are actually doing.
finally had to give up join
So you stopped the joining node, right? Next time 'nodetool netstats -H' could give you interesting progress information on stream. You could also see if a stream get blocked using ‘watch -d nodetool netstats -H'.
Is there any good way to speed up the process of adding new nodes?
It mostly depends on the load you that the existing nodes (CPU / disk) and network can handle. But you can tune the streaming speed with 'stream_throughput_outbound_megabits_per_sec: 200' (200 default value) and the 'inter_dc_stream_throughput_outbound_megabits_per_sec' if streaming cross datacenter in the cassandra.yaml or even dynamically with 'nodetool setstreamthroughput X'
Also, new nodes can probably compact faster as they are not used for serving reads: 'compaction_throughput_mb_per_sec'. Again, with nodetool (and so without restarting the node): 'nodetool setcompactionthroughput X'
Next time you add a node, give us some outputs with commands above and try some distinct tuning. Let us know how it goes before it's too late :-).
Also in recent versions you can 'resume' a bootstrap I believe. That my be handy when using 4 TB big Apache Cassandra nodes.
The Last Pickle - Apache Cassandra Consulting