cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Benjamin Roth (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (CASSANDRA-12905) Retry acquire MV lock on failure instead of throwing WTE on streaming
Date Thu, 15 Dec 2016 09:59:58 GMT

    [ https://issues.apache.org/jira/browse/CASSANDRA-12905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15750958#comment-15750958
] 

Benjamin Roth commented on CASSANDRA-12905:
-------------------------------------------

+1 for the dtest!

For my understanding of hint retransmission:
If we have a hint file that is being processed and a WTE occurs in the middle, will the whole
file be retransmitted or can it be resumed at the last successful position? 

I guess this is not the case from my personal observations. I had situations with > 1GB
hint queues per sender node which were not going to shrink due to WTEs. It seemed like the
same hints have been retransmitted over and over again from scratch instead of trying to resume.
What helped in this situation was to pause hint delivery on all nodes but 1-2. I'm pretty
sure the problem was WTEs due to lock contentions. To be honest, I did not try lowering hinted_handoff_throttle_in_kb
or at least I don't remember.

> Retry acquire MV lock on failure instead of throwing WTE on streaming
> ---------------------------------------------------------------------
>
>                 Key: CASSANDRA-12905
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-12905
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Streaming and Messaging
>         Environment: centos 6.7 x86_64
>            Reporter: Nir Zilka
>            Assignee: Benjamin Roth
>            Priority: Critical
>             Fix For: 3.10
>
>
> Hello,
> I performed two upgrades to the current cluster (currently 15 nodes, 1 DC, private VLAN),
> first it was 2.2.5.1 and repair worked flawlessly,
> second upgrade was to 3.0.9 (with upgradesstables) and also repair worked well,
> then i upgraded 2 weeks ago to 3.9 - and the repair problems started.
> there are several errors types from the system.log (different nodes) :
> - Sync failed between /xxx.xxx.xxx.xxx and /xxx.xxx.xxx.xxx
> - Streaming error occurred on session with peer xxx.xxx.xxx.xxx Operation timed out -
received only 0 responses
> - Remote peer xxx.xxx.xxx.xxx failed stream session
> - Session completed with the following error
> org.apache.cassandra.streaming.StreamException: Stream failed
> ----
> i use 3.9 default configuration with the cluster settings adjustments (3 seeds, GossipingPropertyFileSnitch).
> streaming_socket_timeout_in_ms is the default (86400000).
> i'm afraid from consistency problems while i'm not performing repair.
> Any ideas?
> Thanks,
> Nir.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message