cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Arya Goudarzi (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (CASSANDRA-5432) Repair Freeze/Gossip Invisibility Issues 1.2.4
Date Sat, 20 Apr 2013 00:09:16 GMT

    [ https://issues.apache.org/jira/browse/CASSANDRA-5432?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13637036#comment-13637036
] 

Arya Goudarzi commented on CASSANDRA-5432:
------------------------------------------

Hey Vijay,

Good to see you here. Sorry if my analysis is unclear. Here is my take:

> The first time we start the communication to a node we try to Initiate communications
we use the public IP and eventually once we have the private IP we will switch back to local
ip's.

Has this always been the case? Because if you are using public ips (not public dns name),
there has to be explicit security rules on public ips to allow this. Otherwise, if in security
groups you are opening the ports to the machines in the same group using their security group
name, it allows traffic only within their private ips, so this won't work. 

We use Priam (your awesome tooling), and as you know, it opens up only the SSL port on the
public IPs for cross region communication. And from the operator's perspective, that is the
correct thing to do. I only have the SSL port open on public IPs and don't want to open the
non SSL port for security reasons. Now, all other ports like non SSL, JMX, etc are opened
the way I described using security group names and it allows traffic on private IPs. It is
just the way AWS has been. So, if within the same region, you are trying to connect to any
machine using public ip, it won't work. 

Here is how I achieved the scenario above and I believe they are all co-related to the statement
you said that all machine connect to public IPs first.

Setup a cluster as I described in my previous comment. It can be a single region. Restart
all machines at the same time. Each machine would only see itself at UP. Everyone else is
reported to be DOWN in nodetool ring. I am guessing that it is because they are trying to
send gossips to public IPs but only SSL port is open on public IPs. The cluster is configured
to only do SSL cross datacenter/region not within the same region. So, not I am left with
bunch of nodes that only see themselves in the ring. I go to my AWS console, open up the non
SSL port on every single public IP in that security group. Now all the nodes see each other.


By now, I had a theory about nodes wanting to communicate through the public ip which is not
possible, so I stepped into troubleshooting repairs. I know that with current settings repair
would succeed. Since the nodes see each other now, I go to security groups and remove the
non SSL on public IP rules that I added in previous step. Start the repair, and I ended up
with the log message as above. The public ip mentioned in the log, belongs to the node that
owns the log and is running repair, so it tried to communicated to itself using its own public
IP. 

Did I make sense? I can call you to describe it over the phone, but basically this setup used
to work on 1.1.10 but does not work on 1.2.4. I have attached the debugger to a node and am
trying to trace  the code. I'll let you know if I find something new.
                
> Repair Freeze/Gossip Invisibility Issues 1.2.4
> ----------------------------------------------
>
>                 Key: CASSANDRA-5432
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-5432
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 1.2.4
>         Environment: Ubuntu 10.04.1 LTS
> C* 1.2.3
> Sun Java 6 u43
> JNA Enabled
> Not using VNodes
>            Reporter: Arya Goudarzi
>            Assignee: Vijay
>            Priority: Critical
>
> Read comment 6. This description summarizes the repair issue only, but I believe there
is a bigger problem going on with networking as described on that comment. 
> Since I have upgraded our sandbox cluster, I am unable to run repair on any node and
I am reaching our gc_grace seconds this weekend. Please help. So far, I have tried the following
suggestions:
> - nodetool scrub
> - offline scrub
> - running repair on each CF separately. Didn't matter. All got stuck the same way.
> The repair command just gets stuck and the machine is idling. Only the following logs
are printed for repair job:
>  INFO [Thread-42214] 2013-04-05 23:30:27,785 StorageService.java (line 2379) Starting
repair command #4, repairing 1 ranges for keyspace cardspring_production
>  INFO [AntiEntropySessions:7] 2013-04-05 23:30:27,789 AntiEntropyService.java (line 652)
[repair #cc5a9aa0-9e48-11e2-98ba-11bde7670242] new session: will sync /X.X.X.190, /X.X.X.43,
/X.X.X.56 on range (1808575600,42535295865117307932921825930779602032] for keyspace_production.[comma
separated list of CFs]
>  INFO [AntiEntropySessions:7] 2013-04-05 23:30:27,790 AntiEntropyService.java (line 858)
[repair #cc5a9aa0-9e48-11e2-98ba-11bde7670242] requesting merkle trees for BusinessConnectionIndicesEntries
(to [/X.X.X.43, /X.X.X.56, /X.X.X.190])
>  INFO [AntiEntropyStage:1] 2013-04-05 23:30:28,086 AntiEntropyService.java (line 214)
[repair #cc5a9aa0-9e48-11e2-98ba-11bde7670242] Received merkle tree for ColumnFamilyName from
/X.X.X.43
>  INFO [AntiEntropyStage:1] 2013-04-05 23:30:28,147 AntiEntropyService.java (line 214)
[repair #cc5a9aa0-9e48-11e2-98ba-11bde7670242] Received merkle tree for ColumnFamilyName from
/X.X.X.56
> Please advise. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message