incubator-cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Steven A Robenalt <srobe...@stanford.edu>
Subject Re: Cassandra 2.0.2 - Frequent Read timeouts and delays in replication on 3-node cluster in AWS VPC
Date Wed, 20 Nov 2013 00:55:38 GMT
It seems that with NTP properly configured, the replication is now working
as expected, but there are still a lot of read timeouts. The
troubleshooting continues...


On Tue, Nov 19, 2013 at 8:53 AM, Steven A Robenalt <srobenal@stanford.edu>wrote:

> Thanks Michael, I will try that out.
>
>
> On Tue, Nov 19, 2013 at 5:28 AM, Laing, Michael <michael.laing@nytimes.com
> > wrote:
>
>> We had a similar problem when our nodes could not sync using ntp due to
>> VPC ACL settings. -ml
>>
>>
>> On Mon, Nov 18, 2013 at 8:49 PM, Steven A Robenalt <srobenal@stanford.edu
>> > wrote:
>>
>>> Hi all,
>>>
>>> I am attempting to bring up our new app on a 3-node cluster and am
>>> having problems with frequent read timeouts and slow inter-node
>>> replication. Initially, these errors were mostly occurring in our app
>>> server, affecting 0.02%-1.0% of our queries in an otherwise unloaded
>>> cluster. No exceptions were logged on the servers in this case, and reads
>>> in a single node environment with the same code and client driver virtually
>>> never see exceptions like this, so I suspect problems with the
>>> inter-cluster communication between nodes.
>>>
>>> The 3 nodes are deployed in a single AWS VPC, and are all in a common
>>> subnet. The Cassandra version is 2.0.2 following an upgrade this past
>>> weekend due to NPEs in a secondary index that were affecting certain
>>> queries under 2.0.1. The servers are m1.large instances running AWS Linux
>>> and Oracle JDK7u40. The first 2 nodes in the cluster are the seed nodes.
>>> All database contents are CQL tables with replication factor of 3, and the
>>> application is Java-based, using the latest Datastax 2.0.0-rc1 Java Driver.
>>>
>>> In testing with the application, I noticed this afternoon that the
>>> contents of the 3 nodes differed in their respective copies of the same
>>> table for newly written data, for time periods exceeding several minutes,
>>> as reported by cqlsh on each node. Specifying different hosts from the same
>>> server using cqlsh also exhibited timeouts on multiple attempts to connect,
>>> and on executing some queries, though they eventually succeeded in all
>>> cases, and eventually the data in all nodes was fully replicated.
>>>
>>> The AWS servers have a security group with only ports 22, 7000, 9042,
>>> and 9160 open.
>>>
>>> At this time, it seems that either I am still missing something in my
>>> cluster configuration, or maybe there are other ports that are needed for
>>> inter-node communication.
>>>
>>> Any advice/suggestions would be appreciated.
>>>
>>>
>>>
>>> --
>>> Steve Robenalt
>>> Software Architect
>>> HighWire | Stanford University
>>> 425 Broadway St, Redwood City, CA 94063
>>>
>>> srobenal@stanford.edu
>>> http://highwire.stanford.edu
>>>
>>>
>>>
>>>
>>>
>>>
>>
>
>
> --
> Steve Robenalt
> Software Architect
> HighWire | Stanford University
> 425 Broadway St, Redwood City, CA 94063
>
> srobenal@stanford.edu
> http://highwire.stanford.edu
>
>
>
>
>
>


-- 
Steve Robenalt
Software Architect
HighWire | Stanford University
425 Broadway St, Redwood City, CA 94063

srobenal@stanford.edu
http://highwire.stanford.edu

Mime
View raw message