cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Shalom Sagges <shal...@liveperson.com>
Subject Re: A Single Dropped Node Fails Entire Read Queries
Date Sun, 12 Mar 2017 08:21:54 GMT
Hi Michael,

If a node suddenly fails, and there are other replicas that can still
satisfy the consistency level, shouldn't the request succeed regardless of
the failed node?

Thanks!





Shalom Sagges
DBA
T: +972-74-700-4035
<http://www.linkedin.com/company/164748> <http://twitter.com/liveperson>
<http://www.facebook.com/LivePersonInc> We Create Meaningful Connections
<https://liveperson.docsend.com/view/8iiswfp>


On Fri, Mar 10, 2017 at 6:25 PM, Michael Shuler <michael@pbandjelly.org>
wrote:

> I may be mistaken on the exact configuration option for the timeout
> you're hitting, but I believe this may be the general
> `request_timeout_in_ms: 10000` in conf/cassandra.yaml.
>
> A reasonable timeout for a "node down" discovery/processing is needed to
> prevent random flapping of nodes with a super short timeout interval.
> Applications should also retry on a host unavailable exception like
> this, because in the long run, this should be expected from time to time
> for network partitions, node failure, maintenance cycles, etc.
>
> --
> Kind regards,
> Michael
>
> On 03/10/2017 04:07 AM, Shalom Sagges wrote:
> > Hi daniel,
> >
> > I don't think that's a network issue, because ~10 seconds after the node
> > stopped, the queries were successful again without any timeout issues.
> >
> > Thanks!
> >
> >
> > Shalom Sagges
> > DBA
> > T: +972-74-700-4035
> > <http://www.linkedin.com/company/164748>
> > <http://twitter.com/liveperson>       <http://www.facebook.com/
> LivePersonInc>
> >
> >       We Create Meaningful Connections
> >
> > <https://liveperson.docsend.com/view/8iiswfp>
> >
> >
> >
> > On Fri, Mar 10, 2017 at 12:01 PM, Daniel Hölbling-Inzko
> > <daniel.hoelbling-inzko@bitmovin.com
> > <mailto:daniel.hoelbling-inzko@bitmovin.com>> wrote:
> >
> >     Could there be network issues in connecting between the nodes? If
> >     node a gets To be the query coordinator but can't reach b and c is
> >     obviously down it won't get a quorum.
> >
> >     Greetings
> >
> >     Shalom Sagges <shaloms@liveperson.com
> >     <mailto:shaloms@liveperson.com>> schrieb am Fr. 10. März 2017 um
> 10:55:
> >
> >         @Ryan, my keyspace replication settings are as follows:
> >         CREATE KEYSPACE mykeyspace WITH replication = {'class':
> >         'NetworkTopologyStrategy', 'DC1': '3', 'DC2: '3', 'DC3': '3'}
> >          AND durable_writes = true;
> >
> >         CREATE TABLE mykeyspace.test (
> >             column1 text,
> >             column2 text,
> >             column3 text,
> >             PRIMARY KEY (column1, column2)
> >
> >         The query is */select * from mykeyspace.test where
> >         column1='xxxxx';/*
> >
> >         @Daniel, the replication factor is 3. That's why I don't
> >         understand why I get these timeouts when only one node drops.
> >
> >         Also, when I enabled tracing, I got the following error:
> >         *Unable to fetch query trace: ('Unable to complete the operation
> >         against any hosts', {<Host: 127.0.0.1 DC1>: Unavailable('Error
> >         from server: code=1000 [Unavailable exception] message="Cannot
> >         achieve consistency level LOCAL_QUORUM"
> >         info={\'required_replicas\': 2, \'alive_replicas\': 1,
> >         \'consistency\': \'LOCAL_QUORUM\'}',)})*
> >
> >         But nodetool status shows that only 1 replica was down:
> >         --  Address          Load       Tokens       Owns    Host ID
> >                                   Rack
> >         DN  x.x.x.235  134.32 MB  256          ?
> >         c0920d11-08da-4f18-a7f3-dbfb8c155b19  RAC1
> >         UN  x.x.x.236  134.02 MB  256          ?
> >         2cc0a27b-b1e4-461f-a3d2-186d3d82ff3d  RAC1
> >         UN  x.x.x.237  134.34 MB  256          ?
> >         5b2162aa-8803-4b54-88a9-ff2e70b3d830  RAC1
> >
> >
> >         I tried to run the same scenario on all 3 nodes, and only the
> >         3rd node didn't fail the query when I dropped it. The nodes were
> >         installed and configured with Puppet so the configuration is the
> >         same on all 3 nodes.
> >
> >
> >         Thanks!
> >
> >
> >
> >         On Fri, Mar 10, 2017 at 10:25 AM, Daniel Hölbling-Inzko
> >         <daniel.hoelbling-inzko@bitmovin.com
> >         <mailto:daniel.hoelbling-inzko@bitmovin.com>> wrote:
> >
> >             The LOCAL_QUORUM works on the available replicas in the dc.
> >             So if your replication factor is 2 and you have 10 nodes you
> >             can still only loose 1. With a replication factor of 3 you
> >             can loose one node and still satisfy the query.
> >             Ryan Svihla <rs@foundev.pro <mailto:rs@foundev.pro>> schrieb
> >             am Do. 9. März 2017 um 18:09:
> >
> >                 whats your keyspace replication settings and what's your
> >                 query?
> >
> >                 On Thu, Mar 9, 2017 at 9:32 AM, Shalom Sagges
> >                 <shaloms@liveperson.com <mailto:shaloms@liveperson.com>>
> >                 wrote:
> >
> >                     Hi Cassandra Users,
> >
> >                     I hope someone could help me understand the
> >                     following scenario:
> >
> >                     Version: 3.0.9
> >                     3 nodes per DC
> >                     3 DCs in the cluster.
> >                     Consistency Local_Quorum.
> >
> >                     I did a small resiliency test and dropped a node to
> >                     check the availability of the data.
> >                     What I assumed would happen is nothing at all. If a
> >                     node is down in a 3 nodes DC, Local_Quorum should
> >                     still be satisfied.
> >                     However, during the ~10 first seconds after stopping
> >                     the service, I got timeout errors (tried it both
> >                     from the client and from cqlsh.
> >
> >                     This is the error I get:
> >                     */ServerError:
> >                     com.google.common.util.concurrent.
> UncheckedExecutionException:
> >                     com.google.common.util.concurrent.
> UncheckedExecutionException:
> >                     java.lang.RuntimeException:
> >                     org.apache.cassandra.exceptions.ReadTimeoutException:
> Operation
> >                     timed out - received only 4 responses./*
> >
> >
> >                     After ~10 seconds, the same query is successful with
> >                     no timeout errors. The dropped node is still down.
> >
> >                     Any idea what could cause this and how to fix it?
> >
> >                     Thanks!
> >
> >
> >                     This message may contain confidential and/or
> >                     privileged information.
> >                     If you are not the addressee or authorized to
> >                     receive this on behalf of the addressee you must not
> >                     use, copy, disclose or take action based on this
> >                     message or any information herein.
> >                     If you have received this message in error, please
> >                     advise the sender immediately by reply email and
> >                     delete this message. Thank you.
> >
> >
> >
> >
> >                 --
> >
> >                 Thanks,
> >
> >                 Ryan Svihla
> >
> >
> >
> >         This message may contain confidential and/or privileged
> >         information.
> >         If you are not the addressee or authorized to receive this on
> >         behalf of the addressee you must not use, copy, disclose or take
> >         action based on this message or any information herein.
> >         If you have received this message in error, please advise the
> >         sender immediately by reply email and delete this message. Thank
> >         you.
> >
> >
> >
> > This message may contain confidential and/or privileged information.
> > If you are not the addressee or authorized to receive this on behalf of
> > the addressee you must not use, copy, disclose or take action based on
> > this message or any information herein.
> > If you have received this message in error, please advise the sender
> > immediately by reply email and delete this message. Thank you.
>
>

-- 
This message may contain confidential and/or privileged information. 
If you are not the addressee or authorized to receive this on behalf of the 
addressee you must not use, copy, disclose or take action based on this 
message or any information herein. 
If you have received this message in error, please advise the sender 
immediately by reply email and delete this message. Thank you.

Mime
View raw message