Return-Path: X-Original-To: apmail-cassandra-user-archive@www.apache.org Delivered-To: apmail-cassandra-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 4640710AC0 for ; Fri, 19 Dec 2014 19:29:56 +0000 (UTC) Received: (qmail 12132 invoked by uid 500); 19 Dec 2014 19:29:53 -0000 Delivered-To: apmail-cassandra-user-archive@cassandra.apache.org Received: (qmail 12108 invoked by uid 500); 19 Dec 2014 19:29:53 -0000 Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@cassandra.apache.org Delivered-To: mailing list user@cassandra.apache.org Received: (qmail 12098 invoked by uid 99); 19 Dec 2014 19:29:52 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 19 Dec 2014 19:29:52 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS,T_REMOTE_IMAGE X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of rsvihla@datastax.com designates 209.85.160.181 as permitted sender) Received: from [209.85.160.181] (HELO mail-yk0-f181.google.com) (209.85.160.181) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 19 Dec 2014 19:29:27 +0000 Received: by mail-yk0-f181.google.com with SMTP id 142so664579ykq.26 for ; Fri, 19 Dec 2014 11:27:10 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=datastax.com; s=google; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; bh=80+b22BfADa2JMmbj0G8JlRBIYo02m7ERcXBCBR0KAE=; b=a9L+7rP94pLpq7lwbgJP4eHq+PJsfciL0Fg6Y2kjk6Wwh1Z4kjx/8nACkolEzpavh0 rSsHQDlENvPgXvg1ni5atJfJori4geu1iasAlALK/pbyYb1Mk8v9bPAjOvJFIox9c/Uf JzOoiFKh4N+Wu/FUd2PNotJATjilBPHxd3fco= X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:in-reply-to:references:date :message-id:subject:from:to:content-type; bh=80+b22BfADa2JMmbj0G8JlRBIYo02m7ERcXBCBR0KAE=; b=GMKg6mqiqJ94fTxDTZj2iOauULf/mrsQtpU+Ot8G9tsx4siS/wMJeYy//+dZ7AT4uj KRzW5lA95ZtBYzBgmKCy8FqhTZ+2oYWnBUGYbkWwkeZtG0/4mrj64u16Cpuh2J/VpvJ0 kGzzA07x88LNU+Lypj6Pi5gDHQTimmdU/+WUsy0gAcuYX1n05CaaUikZGpuJQelDQM3b tp0I2A9Lw5Yq6+uEVM9tFAGDG5WG523qyp+2chuogznLORv8oVlt0n1DrTAnvoj79v9S KX5X2TaGHTIxfxs4xJ2jSMQOurtTzOVy9jtUq/fobAJHIO7X56oKBSWSq+YPo9mlSHbX ajfg== X-Gm-Message-State: ALoCoQmXCjMExZAg3zxjI9Ayv1q5ng3cktvaPixjsHVhqUaT4RNPcfgjuPY7/9Twt8ru57pAg4Sp MIME-Version: 1.0 X-Received: by 10.236.210.5 with SMTP id t5mr7658854yho.104.1419017230513; Fri, 19 Dec 2014 11:27:10 -0800 (PST) Received: by 10.170.216.2 with HTTP; Fri, 19 Dec 2014 11:27:10 -0800 (PST) In-Reply-To: References: <1418985863301.aa0f9c2c@Nodemailer> Date: Fri, 19 Dec 2014 13:27:10 -0600 Message-ID: Subject: Re: Multi DC informations (sync) From: Ryan Svihla To: user@cassandra.apache.org Content-Type: multipart/alternative; boundary=089e0160a586f4918f050a96b1bf X-Virus-Checked: Checked by ClamAV on apache.org --089e0160a586f4918f050a96b1bf Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable replies inline On Fri, Dec 19, 2014 at 10:30 AM, Alain RODRIGUEZ wrote: > > All that you said match the idea I had of how it works except this part: > > "The request blocks however until all CL is satisfied" --> Does this mean > that the client will see an error if the local DC write the data correctl= y > (i.e. CL reached) but the remote DC fails ? This is not the idea I had of > something asynchronous... > Asynchronous is just all requests are sent out at once..the client response is blocked till CL is satisfied or timeout occurs. If CL is one for example..the first response back will be a "success" on the client..regardless of what's happened in the background. If it's say ALL..then yes it'd wait for all responses to come back. > > If it doesn't fail on client side (real asynchronous), is there a way to > make sure remote DC has indeed received the information ? I mean if the > throughput cross regions is to small, the write will fail and so will the > HH, potentially. How to detect we are lacking of throughput cross DC for > example ? > monitoring logging, etc, etc, etc If an application needs EACH_QUORUM consistency across all data centers and the performance penalty is worthwhile..then that's probably what you're asking for. If LOCAL_QUORUM + regular repairs is fine..then do that..if CL ONE is fine then do that. You SHOULD BE monitoring dropped mutations and Hints via JMX or something like Opscenter. Outages of substantial length should probably involve a repair, if it's over your HH timeout, it DEFINITELY should involve a repair. If you ever have a doubt it should involve repair. > > Repairs are indeed a good thing (we run them as a weekly routine, GC grac= e > period 10 sec), but having inconsistency for a week without knowing it is > quite an issue. > Then use a higher consistency level so that the client is not surprised, and knows the state of things, and doesn't consider a write successful until it's consistent across the data centers (i'd argue this is probably not what you really want, but different applications have different needs). If you need only local data center level awareness doing LOCAL_QUORUM reads and writes will get you to where you want, but complete multidatacenter nearly immediate consistency that you know about on the client is not free, and it isn't with any system. > > Thanks for this detailed information Ryan, I hope I am clear enough while > expressing my doubts. > > I think it's a bit of a misunderstanding of the tools available. If you have a need for full nearly immediate data center consistency, my suggestion is a sizing (from a network pipe and application design SLA perspective) for a higher CL on writes and potentially reads, the tools are there. > C*heers > > Alain > > 2014-12-19 15:43 GMT+01:00 Ryan Svihla : >> >> More accurately,the write path of Cassandra in a multi dc sense is kinda >> like the following >> >> 1. write goes to a node which acts as coordinator >> 2. writes go out to all replicas in that DC, and then one write per >> remote DC goes out to another node which takes responsibility for writin= g >> to all replicas in it's data center. The request blocks however until al= l >> CL is satisfied. >> 3. if any of these writes fail by default a hinted handoff is generated.= . >> >> So as you can see..there is effectively not "lag" beyond either raw >> network latency+node speed and/or just failed writes and waiting on hint >> replay to occur. Likewise repairs can be used to make the data centers b= ack >> in sync, and in the case of substantial outages you will need repairs to >> bring you back in sync, you're running repairs already right? >> >> Think of Cassandra as a global write, and not a message queue, and you'v= e >> got the basic idea. >> >> >> On Fri, Dec 19, 2014 at 7:54 AM, Alain RODRIGUEZ >> wrote: >> >>> Hi Jens, thanks for your insight. >>> >>> Replication lag in Cassandra terms is probably =E2=80=9CHinted handoff= =E2=80=9D --> Well >>> I think hinted handoff are only used when a node is down, and are not e= ven >>> mandatory enabled. I guess that cross DC async replication is something >>> else, taht has nothing to see with hinted handoff, am I wrong ? >>> >>> `nodetool status` is your friend. It will tell you whether the cluster >>> considers other nodes reachable or not. Run it on a node in the datacen= ter >>> that you=E2=80=99d like to test connectivity from. --> Connectivity =E2= =89=A0 write success >>> >>> Basically the two question can be changed this way: >>> >>> 1 - How to monitor the async cross dc write latency ? >>> 2 - What error should I look for when async write fails (if any) ? Or i= s >>> there any other way to see that network throughput (for example) is too >>> small for a given traffic. >>> >>> Hope this is clearer. >>> >>> C*heers, >>> >>> Alain >>> >>> 2014-12-19 11:44 GMT+01:00 Jens Rantil : >>>> >>>> Alain, >>>> >>>> AFAIK, the DC replication is not linearizable. That is, writes are are >>>> not replicated according to a binlog or similar like MySQL. They are >>>> replicated concurrently. >>>> >>>> To answer you questions: >>>> 1 - Replication lag in Cassandra terms is probably =E2=80=9CHinted han= doff=E2=80=9D. >>>> You=E2=80=99d want to check the status of that. >>>> 2 - `nodetool status` is your friend. It will tell you whether the >>>> cluster considers other nodes reachable or not. Run it on a node in th= e >>>> datacenter that you=E2=80=99d like to test connectivity from. >>>> >>>> Cheers, >>>> Jens >>>> >>>> =E2=80=94=E2=80=94=E2=80=94 Jens Rantil Backend engineer Tink AB Email= : jens.rantil@tink.se >>>> Phone: +46 708 84 18 32 Web: www.tink.se Facebook Linkedin Twitter >>>> >>>> >>>> On Fri, Dec 19, 2014 at 11:16 AM, Alain RODRIGUEZ >>>> wrote: >>>> >>>>> Hi guys, >>>>> >>>>> We expanded our cluster to a multiple DC configuration. >>>>> >>>>> Now I am wondering if there is any way to know: >>>>> >>>>> 1 - The replication lag between these 2 DC (Opscenter, nodetool, othe= r >>>>> ?) >>>>> 2 - Make sure that sync is ok at any time >>>>> >>>>> I guess big companies running Cassandra are interested in these kind >>>>> of info, so I think something exist but I am not aware of it. >>>>> >>>>> Any other important information or advice you can give me about best >>>>> practices or tricks while running a multi DC (cross regions US <-> EU= ) is >>>>> welcome of course ! >>>>> >>>>> cheers, >>>>> >>>>> Alain >>>>> >>>> >>>> >> >> >> -- >> >> [image: datastax_logo.png] >> >> Ryan Svihla >> >> Solution Architect >> >> [image: twitter.png] [image: linkedin.png] >> >> >> DataStax is the fastest, most scalable distributed database technology, >> delivering Apache Cassandra to the world=E2=80=99s most innovative enter= prises. >> Datastax is built to be agile, always-on, and predictably scalable to an= y >> size. With more than 500 customers in 45 countries, DataStax is the >> database technology and transactional backbone of choice for the worlds >> most innovative companies such as Netflix, Adobe, Intuit, and eBay. >> >> --=20 [image: datastax_logo.png] Ryan Svihla Solution Architect [image: twitter.png] [image: linkedin.png] DataStax is the fastest, most scalable distributed database technology, delivering Apache Cassandra to the world=E2=80=99s most innovative enterpri= ses. Datastax is built to be agile, always-on, and predictably scalable to any size. With more than 500 customers in 45 countries, DataStax is the database technology and transactional backbone of choice for the worlds most innovative companies such as Netflix, Adobe, Intuit, and eBay. --089e0160a586f4918f050a96b1bf Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable
replies inline

On Fri, Dec 19, 2014 at 10:30 AM, Alain RODRIGUEZ <arod= rime@gmail.com> wrote:
All that you said match the idea I had of = how it works except this part:

"The request blocks however until all CL is satis= fied" --> Does this mean that the client will see an error if the l= ocal DC write the data correctly (i.e. CL reached) but the remote DC fails = ? This is not the idea I had of something asynchronous...


Asynchronous is just all requests are sent out a= t once..the client response is blocked till CL is satisfied or timeout occu= rs.

If CL is one for example..the first=20 response back will be a "success" on the client..regardless of wh= at's=20 happened in the background. If it's say ALL..then yes it'd wait for= all=20 responses to come back.
=C2=A0

If it doesn'= t fail on client side (real asynchronous), is there a way to make sure remo= te DC has indeed received the information ? I mean if the throughput cross = regions is to small, the write will fail and so will the HH, potentially. H= ow to detect we are lacking of throughput cross DC for example ?
monitoring logging, etc, etc, etc
<= br>If an application needs EACH_QUORUM consistency across all data centers = and the performance penalty is worthwhile..then that's probably what yo= u're asking for. If LOCAL_QUORUM + regular repairs is fine..then do tha= t..if CL ONE is fine then do that.

You SHOULD BE monitor= ing dropped mutations and Hints via JMX or something like Opscenter. Outage= s of substantial length should probably involve a repair, if it's over = your HH timeout, it DEFINITELY should involve a repair. If you ever have a = doubt it should involve repair.
=C2=A0

Repairs are indeed a good thing (we run them= as a weekly routine, GC grace period 10 sec), but having inconsistency for= a week without knowing it is quite an issue.
=
Then use a higher consistency level so that the client is no= t surprised, and knows the state of things, and doesn't consider a writ= e successful until it's consistent across the data centers (i'd arg= ue this is probably not what you really want, but different applications ha= ve different needs). If you need only local data center level awareness doi= ng LOCAL_QUORUM reads and writes will get you to where you want, but comple= te multidatacenter nearly immediate consistency that you know about on the = client is not free, and it isn't with any system.

=C2= =A0

Thanks for this detailed information Ryan, I hope = I am clear enough while expressing my doubts.


I think it's a bit of a misunderstanding = of the tools available. If you have a need for full nearly immediate data c= enter consistency, my suggestion is a sizing (from a network pipe and appli= cation design SLA perspective) for a higher CL on writes and potentially re= ads, the tools are there.

=C2=A0
C*heers

Alain
<= /span>
2014-12-19 15:43 GMT+01:00 Ryan Svihla <rs= vihla@datastax.com>:
More accurately,the write path of Cassandra in = a multi dc sense is kinda like the following

1. write go= es to a node which acts as coordinator
2. writes go out to all re= plicas in that DC, and then one write per remote DC goes out to another nod= e which takes responsibility for writing to all replicas in it's data c= enter. The request blocks however until all CL is satisfied.
3. i= f any of these writes fail by default a hinted handoff is generated..
=

So as you can see..there is effectively not "lag&q= uot; beyond either raw network latency+node speed and/or just failed writes= and waiting on hint replay to occur. Likewise repairs can be used to make = the data centers back in sync, and in the case of substantial outages you w= ill need repairs to bring you back in sync, you're running repairs alre= ady right?

Think of Cassandra as a global write, a= nd not a message queue, and you've got the basic idea.


On Fri, Dec 19, 2014 at 7:54 AM, Alain RODRIGUEZ <<= a href=3D"mailto:arodrime@gmail.com" target=3D"_blank">arodrime@gmail.com> wrote:
<= div dir=3D"ltr">Hi Jens, thanks for your insight.

Replication lag in Cassandra terms is probabl= y =E2=80=9CHinted handoff=E2=80=9D --> Well I think hinted handoff are o= nly used when a node is down, and are not even mandatory enabled. I guess t= hat cross DC async replication is something else, taht has nothing to see w= ith hinted handoff, am I wrong ?

`no= detool status` is your friend. It will tell you whether the cluster conside= rs other nodes reachable or not. Run it on a node in the datacenter that yo= u=E2=80=99d like to test connectivity from. --> Connectivity =E2=89=A0 w= rite success

Basically the two question = can be changed this way:

1 - How to monitor the async cross dc write latenc= y ?
2 - What error should I look for when async write fails (if a= ny) ? Or is there any other way to see that network throughput (for example= ) is too small for a given traffic.

Hope this is c= learer.

C*heers,

Alain

2014-12-19 11:44 GMT+01:00 Jens Rantil <jens.rantil@tink.se>:
Alain,

AFAIK, the DC replication is not linearizable. That is, writes are are= not replicated according to a binlog or similar like MySQL. They are repli= cated concurrently.

To answer you questions:
1 - Replication lag in Cassandra terms is probably =E2=80=9CHinted han= doff=E2=80=9D. You=E2=80=99d want to check the status of that.
2 - `nodetool status` is your friend. It will tell you whether the clu= ster considers other nodes reachable or not. Run it on a node in the datace= nter that you=E2=80=99d like to test connectivity from.

Cheers,
Jens

=E2=80=94=E2=80=94=E2=80=94 Jens Rantil Backend engineer Tink AB Email:=C2=A0jens.r= antil@tink.se Phone: +46 708 84 18 32 Web:=C2=A0www.tink.se Facebook=C2=A0Linkedin=C2=A0Twitter


On Fri, Dec 19, 2014 at 11:16 AM, Ala= in RODRIGUEZ <arodrime@gmail.com> wrote:

Hi guys,

We expanded our cluster to a multiple DC configuration.

Now I am wondering if there is any way to know:

1 - The replication lag between these 2 DC (Opscenter, nodetool, other= ?)
2 - Make sure that sync is ok at any time

I guess big companies running Cassandra are interested in these kind o= f info, so I think something exist but I am not aware of it.

Any other important information or advice you can give me about best p= ractices or tricks while running a multi DC (cross regions US <-> EU)= is welcome of course !

cheers,

Alain




<= /div>--

= 3D"datastax_logo.png=

Ryan Svihla

Solution Architect= <= /span>


3D"t= 3D"linkedin.png"

DataStax is the fastest, most scalable distributed database technolog= y, delivering Apache Cassandra to the world=E2=80=99s most innovative enter= prises. Datastax is built to be agile, always-on, and predictably scalable = to any size. With more than 500 customers in 45 countries, DataStax is the database technology and trans= actional backbone of choice for the worlds most innovative companies such a= s Netflix, Adobe, Intuit, and eBay.




--
--089e0160a586f4918f050a96b1bf--