Mailing-List: contact dev-help@hawq.incubator.apache.org; run by ezmlm
Precedence: bulk
Reply-To: dev@hawq.incubator.apache.org
MIME-Version: 1.0
References: <CADvk=GZuELORxeP6Yw8V40dSfDC9PooSCx86WX7ueAGr=ybnBw@mail.gmail.com>
 <CA+F1ufmofKFqmnXKEUGtzO29sjkfuyP-3WcqpULc+Cwbrhw4WA@mail.gmail.com>
 <CADvk=GZVU7nHXUOFzU3txKv+B2AwxN7Yi+U6ZwN60+dvB3yzRQ@mail.gmail.com>
 <CA+F1ufmeMba=PbyRr3XxO-aHvDKN6QMabh=b522MSy=c4Khpyw@mail.gmail.com>
 <CADvk=GYDo8_wn_sUHEJ3w+YECW-u6A5-TrcUGZ_ZCZ+Tb0PFHw@mail.gmail.com>
 <CANAtR60BjSOnh8aQZuK9euZbgdW7dymwfOhTepUkVXzrA9FpAA@mail.gmail.com> <CA+F1ufkqiRMZhErVuhaz3qk8qjNV8ZwdqCAKXBS0DhvGfFR5tg@mail.gmail.com>
In-Reply-To: <CA+F1ufkqiRMZhErVuhaz3qk8qjNV8ZwdqCAKXBS0DhvGfFR5tg@mail.gmail.com>
From: Kyle Dunn <kdunn@pivotal.io>
Date: Fri, 16 Sep 2016 18:51:19 +0000
Message-ID: <CADvk=GaWP=q5YoLdKqGd4-8gOCNoaZRhveFDTy-7BvqFn8hdKA@mail.gmail.com>
Subject: Re: HAWQ standby master sync process
To: Ming Li <mli@pivotal.io>, dev@hawq.incubator.apache.org
Cc: Radar Da lei <rlei@pivotal.io>
Content-Type: multipart/alternative; boundary=94eb2c06eb224450b0053ca473d3
archived-at: Fri, 16 Sep 2016 18:51:41 -0000

--94eb2c06eb224450b0053ca473d3
Content-Type: text/plain; charset=UTF-8

A couple follow-on questions that originated from a production user:

1) Is there a way to ensure a standby master is "up-to-date" with WALs,
either via a SQL query or some other process-external way?

2) Can a full archive of the standby MASTER_DATA_DIRECTORY be used in the
restoration of another master at the DR site (or the originating one)? I
realize there are some "role to hostname" mappings in the catalog that
would need to be updated, but otherwise, [how] do the active and standby
catalogs differ? This is useful as an alternative to changing the WAL
send/receive path in the code path but allows "snapshotting" the existing
standby master without disturbing normal activity on the active master.


Thanks,
Kyle

On Mon, Sep 12, 2016 at 9:37 PM Ming Li <mli@pivotal.io> wrote:

> Yes, as Wen said, we currently don't support 2 standby nodes at the same
> time, but we can change code/design to support it after the design
> finalized.
>
> As for the master connect to 2 standby nodes directly, I think it is not
> the feasible way:
> 1) Now standby process will report 'out of sync' if the connection to
> master lost, and it can't be changed to 'synced' without re-init standby
> node. It maybe a bug or design limitation which I have not investigated.
> 2) Remote standby sync will slow down master transaction commit processing
> extremely, the responsible time will be greatly prolong, it is not
> acceptable if the network is not good and fast enough.
> 3) Master node always keep busy, it means other concurrent workload will
> slow down sync process, and also the sync process will slow down the
> throughout of the whole master cluster.
>
> Maybe more discussions or solutions are needed.  Thanks.
>
>
>
> On Tue, Sep 13, 2016 at 9:56 AM, Wen Lin <wlin@pivotal.io> wrote:
>
>> Kyle,
>>
>> When HAWQ cluster is initialized, if a standby master is configured in
>> hawq-site.xml, the HAWQ scripts will initialize standby master on one
>> node,
>> and register it into master's gp_segment_configuration table. So the
>> master
>> knows standby master from this catalog table.
>> Unlike segment instance, which is register itself by sending heartbeat
>> message to master, standby master has no heartbeat message.
>> It's not possible to have two standby masters running together, if you
>> initialize another standby master, the first one in
>> gp_segment_configuration table will be removed.
>>
>> Regards!
>>
>> Wen
>>
>> On Tue, Sep 13, 2016 at 5:32 AM, Kyle Dunn <kdunn@pivotal.io> wrote:
>>
>> > Hey Ming -
>> >
>> > Am I understanding correctly that a standby master will register
>> > automagically to the active master, based on the contents of
>> hawq-site.xml?
>> >
>> > What would happen if two different standby masters on different nodes
>> both
>> > tried registering with the same active master? I ask because this is the
>> > exact situation that would be useful for having a passive DR site with
>> HAWQ
>> > installed, querying for new WALs in the same flow as a local standby.
>> >
>> > As for "daisy chaining" masters, which I believe is what you described
>> in
>> > (2) above: Master -> WAL -> Standby -> DR node, I think this may be less
>> > desirable than multiple "normal" standby client nodes, as losing the
>> > standby node becomes a cascading failure into DR.
>> >
>> > Anytime we can make use of the DFS available (I say DFS, rather than
>> HDFS,
>> > as the hope is eventually this would be S3, Azure blob, Ceph, etc) - we
>> > should!  (unrelated to DR) In my mind, this includes propagating the
>> > system catalog to segment nodes via the underlying DFS, rather than
>> > transmitting as part of each query.
>> >
>> > Thank you for the helpful insight and discussion!
>> >
>> >
>> > -Kyle
>> >
>> > On Thu, Sep 8, 2016 at 10:55 PM Ming Li <mli@pivotal.io> wrote:
>> >
>> >> Hi Kyle,
>> >>
>> >> As for your question how to config standby host, when standby
>> nodes(which
>> >> is config in hawq-site.xml) started, it will auto registered it's info
>> in
>> >> the system table gp_segment_configuration(
>> >> there is system table:
>> >> http://hdb.docs.pivotal.io/20/reference/catalog/gp_segment_
>> >> configuration.html),
>> >> so that hawq can use this info internally in catalog.  if you need more
>> >> details about it, @wen lin can help you.
>> >>
>> >> Then standby will report the LSN of WALs it synched to master node,
>> master
>> >> node according to this LSN to test the gap between master and node is
>> >> still
>> >> in xlog file or it is overwritten (because xlog file recycled). If the
>> gap
>> >> is not in the xlog file, we cannot do further just report "out of
>> sync",
>> >> which need to manually run hawq init standby to recreate standby node;
>> >> else
>> >> we just push the WAL after this LSN to standby node, and redo them. All
>> >> related standby script problem can ask @radar for help.
>> >>
>> >> In most cases the standby should be less workload than master, so I
>> >> suggestion maybe we can implement it as:
>> >> (1) Master push WAL to standby node, when standby received them, it
>> >> firstly
>> >> write to file, then report successfully to master so that no blocking
>> >> transaction commit.
>> >> (2) standby node redo them on this node, and at the same time, it need
>> to
>> >> guarantee that the WAL should be transferred to the remote DR node, we
>> can
>> >> set different sync policy (whether need to guarantee WAL transferred to
>> >> remote node when transaction committed ) in case of different
>> transaction
>> >> commit latency and different data loss acceptance at remote node.
>> >>
>> >> More to discussed:
>> >> (1) If standby "report out of sync" and gap is not available on master
>> >> node, we need to reinit standby manually, which need to shutdown master
>> >> node. We need to think an stronger policy for this scenario, e.g. just
>> >> push
>> >> WAL to other nodes, and write as duplicate file? or we can further to
>> >> write
>> >> into hdfs directly?
>> >> (2) If multiple master feature implemented, maybe the design need to be
>> >> changed. I don't take time on it.
>> >>
>> >> Any comments or suggestions are welcomed. Thanks.
>> >>
>> >>
>> >> On Fri, Sep 9, 2016 at 1:22 AM, Kyle Dunn <kdunn@pivotal.io> wrote:
>> >>
>> >> > Ming -
>> >> >
>> >> > Thank you for the info, this is very helpful in understanding how WAL
>> >> > shipment happens.
>> >> >
>> >> > One question I have is: if/where the destination host is configured
>> in
>> >> > walsendserver.c? Alternatively, does a standby master client initiate
>> >> the
>> >> > request rather than the active master pushing out WALs as they become
>> >> > available? I ask because for a more robust DR solution than what I'm
>> >> > currently working on would allow multiple standby targets (i.e. one
>> >> > traditional standby, one DR mirror, etc.)
>> >> >
>> >> > At the moment I've opted for an approach that stops the active HAWQ
>> >> master,
>> >> > creates a tarball of the entire MASTER_DATA_DIRECTORY, archives it on
>> >> HDFS,
>> >> > then invokes distcp via Apache Falcon to mirror /hawq_default in
>> HDFS to
>> >> > the DR site. After a DR event there would be some manual process to
>> >> restore
>> >> > said archive and update the hostname / DFS references to reflect the
>> >> actual
>> >> > DR environment.
>> >> >
>> >> > This approach is a step in the right direction but the act of
>> creating
>> >> the
>> >> > tarball necessitates a brief HAWQ master outage (currently ~1 minute
>> >> when
>> >> > excluding pg_log contents and not compressing), whereas extending the
>> >> > walserver code could avoid any outage by allowing WAL replication to
>> >> have
>> >> > multiple destinations.
>> >> >
>> >> > The top-level code for orchestrating this process is currently
>> written
>> >> in
>> >> > Python 2.6 compatible code - I'd like to have some review of it by
>> the
>> >> DEV
>> >> > team, if possible, as a first step to a future PR for "HAWQ DR" via
>> >> Falcon.
>> >> >
>> >> > Thoughts?
>> >> >
>> >> >
>> >> > -Kyle
>> >> >
>> >> > On Mon, Sep 5, 2016 at 9:41 AM Ming Li <mli@pivotal.io> wrote:
>> >> >
>> >> > > Hi,
>> >> > >
>> >> > > The general idea please refer to PostgreSQL:
>> >> > >
>> >> > > https://www.pgcon.org/2008/schedule/attachments/61_
>> >> > Synchronous%20Log%20Shipping%20Replication.pdf
>> >> > >
>> >> > >
>> >> > > Here just share some info about standby code.
>> >> > >
>> >> > > The standby related code is here:
>> >> > > src/backend/postmaster/walredoserver.c
>> >> > > src/backend/postmaster/walsendserver.c
>> >> > >
>> >> > > Global pic:
>> >> > > - Backend generate WAL and pass it to the forked process "WAL
>> Sender",
>> >> > the
>> >> > > calling stack is: XLogQDMirrorWrite() =>
>> >> WalSendServerClientSendRequest
>> >> > ()
>> >> > >
>> >> > > - "WAL sender" process will be forked up and loop for processing
>> >> request
>> >> > > and response, the calling stack is:
>> >> > > walsendserver_forkexec() -> walsendserver_start() -> ServiceMain()
>> ->
>> >> > > ServiceListenLoop() -> ServiceProcessRequest() ->
>> >> > > serviceConfig->ServiceRequest()
>> >> > > -> WalSendServer_ServiceRequest()
>> >> > >
>> >> > > - "WAL Sender" send WAL to "WAL Receiver" which is on the standby
>> >> node,
>> >> > the
>> >> > > calling stack is:
>> >> > > WalSendServer_ServiceRequest() => WalSendServerDoRequest() =>
>> >> > > disconnectMirrorQD_SendClose() => write_qd_sync() => PQsendQuery()
>> >> > >
>> >> > > - On the standby side, all API are similar,  e.g.
>> >> > walredoserver_forkexec()
>> >> > > vs walsendserver_forkexec()
>> >> > >
>> >> > > Hope it helps you! ~_~
>> >> > >
>> >> > >
>> >> > >
>> >> > > On Thu, Aug 11, 2016 at 1:09 AM, Kyle Dunn <kdunn@pivotal.io>
>> wrote:
>> >> > >
>> >> > > > Hello,
>> >> > > >
>> >> > > > I'm investigating DR options for HAWQ and was curious about the
>> >> > existing
>> >> > > > master catalog synchronization process. My question is mainly
>> around
>> >> > what
>> >> > > > this process does at a high level and where I might look in the
>> code
>> >> > base
>> >> > > > or management tools to see about extending it for additional
>> standby
>> >> > > > masters (e.g. one in a geographically distant data center and/or
>> >> > > different
>> >> > > > logical HAWQ cluster). The assumption is the HDFS blocks would be
>> >> > > > replicated by something like distcp via Falcon.
>> >> > > >
>> >> > > > I believe there are obvious things to address like DFS / namenode
>> >> URI
>> >> > > > parameters, FQDNs, and certainly failure scenarios / edge cases,
>> but
>> >> > I'm
>> >> > > > mainly trying to get a dialog started to see what input, ideas,
>> and
>> >> > > > considerations others have. One thing I'm specifically interested
>> >> in is
>> >> > > > whether / how WAL can be used (@Keaton).
>> >> > > >
>> >> > > >
>> >> > > > Thanks,
>> >> > > > Kyle
>> >> > > > --
>> >> > > > *Kyle Dunn | Data Engineering | Pivotal*
>> >> > > > Direct: 303.905.3171 <3039053171> | Email: kdunn@pivotal.io
>> >> > > >
>> >> > >
>> >> > --
>> >> > *Kyle Dunn | Data Engineering | Pivotal*
>> >> > Direct: 303.905.3171 <3039053171> | Email: kdunn@pivotal.io
>> >> >
>> >>
>> > --
>> > *Kyle Dunn | Data Engineering | Pivotal*
>> > Direct: 303.905.3171 <3039053171> | Email: kdunn@pivotal.io
>> >
>>
>
> --
*Kyle Dunn | Data Engineering | Pivotal*
Direct: 303.905.3171 <3039053171> | Email: kdunn@pivotal.io

--94eb2c06eb224450b0053ca473d3--