From dev-return-47115-archive-asf-public=cust-asf.ponee.io@ignite.apache.org  Fri Aug 23 11:02:38 2019
Return-Path: <dev-return-47115-archive-asf-public=cust-asf.ponee.io@ignite.apache.org>
X-Original-To: archive-asf-public@cust-asf.ponee.io
Delivered-To: archive-asf-public@cust-asf.ponee.io
Received: from mail.apache.org (hermes.apache.org [207.244.88.153])
	by mx-eu-01.ponee.io (Postfix) with SMTP id 815F2180637
	for <archive-asf-public@cust-asf.ponee.io>; Fri, 23 Aug 2019 13:02:38 +0200 (CEST)
Received: (qmail 28506 invoked by uid 500); 23 Aug 2019 11:02:37 -0000
Mailing-List: contact dev-help@ignite.apache.org; run by ezmlm
Precedence: bulk
List-Help: <mailto:dev-help@ignite.apache.org>
List-Unsubscribe: <mailto:dev-unsubscribe@ignite.apache.org>
List-Post: <mailto:dev@ignite.apache.org>
List-Id: <dev.ignite.apache.org>
Reply-To: dev@ignite.apache.org
Delivered-To: mailing list dev@ignite.apache.org
Received: (qmail 28494 invoked by uid 99); 23 Aug 2019 11:02:37 -0000
Received: from pnap-us-west-generic-nat.apache.org (HELO spamd2-us-west.apache.org) (209.188.14.142)
    by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 23 Aug 2019 11:02:37 +0000
Received: from localhost (localhost [127.0.0.1])
	by spamd2-us-west.apache.org (ASF Mail Server at spamd2-us-west.apache.org) with ESMTP id B0AB91A3266
	for <dev@ignite.apache.org>; Fri, 23 Aug 2019 11:02:36 +0000 (UTC)
X-Virus-Scanned: Debian amavisd-new at spamd2-us-west.apache.org
X-Spam-Flag: NO
X-Spam-Score: 1.8
X-Spam-Level: *
X-Spam-Status: No, score=1.8 tagged_above=-999 required=6.31
	tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1,
	DKIM_VALID_EF=-0.1, HTML_MESSAGE=2, RCVD_IN_DNSWL_NONE=-0.0001,
	SPF_HELO_NONE=0.001, SPF_PASS=-0.001] autolearn=disabled
Authentication-Results: spamd2-us-west.apache.org (amavisd-new);
	dkim=pass (2048-bit key) header.d=gmail.com
Received: from mx1-ec2-va.apache.org ([10.40.0.8])
	by localhost (spamd2-us-west.apache.org [10.40.0.9]) (amavisd-new, port 10024)
	with ESMTP id n9rBB7pZs37l for <dev@ignite.apache.org>;
	Fri, 23 Aug 2019 11:02:33 +0000 (UTC)
Received-SPF: Pass (mailfrom) identity=mailfrom; client-ip=209.85.217.47; helo=mail-vs1-f47.google.com; envelope-from=alexey.scherbakoff@gmail.com; receiver=<UNKNOWN> 
Received: from mail-vs1-f47.google.com (mail-vs1-f47.google.com [209.85.217.47])
	by mx1-ec2-va.apache.org (ASF Mail Server at mx1-ec2-va.apache.org) with ESMTPS id 2C967BC7F0
	for <dev@ignite.apache.org>; Fri, 23 Aug 2019 11:02:33 +0000 (UTC)
Received: by mail-vs1-f47.google.com with SMTP id x20so5714064vsx.13
        for <dev@ignite.apache.org>; Fri, 23 Aug 2019 04:02:33 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=20161025;
        h=mime-version:references:in-reply-to:from:date:message-id:subject:to;
        bh=wc43wIXRETFaqpwJIYO7kdC9Ejk0AmL+5FRA/YdV0co=;
        b=cfLkBdN6uvEOikygfQiXQ4riborXgigim6wbjPYz5Z8TT5ti3wPHfAcbf3wwoI1OjA
         YS1QoHEspD3VRTU9/xPKB/cY4DH1b/YxjypLcZqKaxxt5bdK2ETic+o7ntRo8VEJ93NN
         +dzlOcSdGbD5qfcqteBxR9aq1nc6yW9DZUbHiUajzwpqc1ZHqAPQyHaEaKfWpEW5Zxn8
         4AwblMNzBHP6puFI0oxDSvlmLjwFAcOmUaGbUeFs5Q7XE+sLgCigdFVAChYWnL7NWFjJ
         Utc819Ug4erP9ykM42ZIu2P+ECo1/RfSXst5HeDWyhcWn4siuIZIlbl0NUCxswh6U8Ge
         D0vQ==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20161025;
        h=x-gm-message-state:mime-version:references:in-reply-to:from:date
         :message-id:subject:to;
        bh=wc43wIXRETFaqpwJIYO7kdC9Ejk0AmL+5FRA/YdV0co=;
        b=N4HSI9WwHv0Wu+NEDK2/gQtGWYrs4UJ9nDITkNy6uRiaCqaUCirkZXCdqQIVUfrVia
         EWs1+TtQUahPV65kULtliciEyeyob94Q9Ot6dA3BPbM7OdpSNk2kvjqTuFJ27ktWvpuI
         o9d40HoHwHHsYwcg6YJoMB+G+SV5kIFX8ge9U7kBbEMhM3M+vA/PJkMADbnSPHmg3fo5
         jwzaRMy06OGBzH+wAiZdYA4JqUgPfXyilyP/T/jiwCwCbg5qIOHWdfiotENF5itLuIf0
         XXzuFLBdIG+VAfFMnQkYB2IIMuBz8PM9iZ4zf223z2U/Sff5FzExkPuF0dRbeSJ3vToz
         gJ7g==
X-Gm-Message-State: APjAAAWBTEAygHBvVWVAiGo7fPRzOR1yYbF+PXzF/C9pxzlL9BVhyGQ2
	X9cXw4RbRrcreEQQWjT+aEZi8jSgDGqpRDThi4r/DA==
X-Google-Smtp-Source: APXvYqxcAZezxwMRTMm1zKmZHWNKufzYFd1w0rPGpgTxa2XxURrzHU0I59kv16dZULwTDrs24IOtQZYxP/4RN9K+ZdE=
X-Received: by 2002:a67:33c6:: with SMTP id z189mr2355538vsz.150.1566558146287;
 Fri, 23 Aug 2019 04:02:26 -0700 (PDT)
MIME-Version: 1.0
References: <64FAD205-E9F1-4F46-A909-18EC7F96ECD7@gmail.com>
 <CAMegbcLrxc2whRae9n4fJCQj6cPO5sFYk62Txh+igHpGuoyN=g@mail.gmail.com>
 <83F5ABF3-46A1-4035-AE96-90800D389C85@gmail.com> <1565850497.739178318@f115.i.mail.ru>
 <CAHKVN8idaBRYpNyJH39mnXgu5APWs-QtDn-7x847HWhiqjcDfg@mail.gmail.com>
 <7A68954B-2FDA-44CD-84BB-895393F31071@gmail.com> <CAKnekaQyVdatcied=P0OvOFe7qaJ8CeJa8rPoDhgqjYy7GTDeA@mail.gmail.com>
 <47960ACE-14C5-428A-8F83-653413FC0391@gmail.com> <CAMegbcLG6LTXF+AxV_Cpr22_nTPnKH6zy0gNF2ysgk=uMUOO7A@mail.gmail.com>
 <11BA2448-386B-48E2-90FE-CFBD8BE77D40@gmail.com> <CAMegbcL9_gYo5XFdOZSGZLV4jP=CpMQeaEw6yqL42YZgtr-nwg@mail.gmail.com>
 <b3f8254c-adf4-4b64-9424-8369d3f4a733@Spark>
In-Reply-To: <b3f8254c-adf4-4b64-9424-8369d3f4a733@Spark>
From: Alexei Scherbakov <alexey.scherbakoff@gmail.com>
Date: Fri, 23 Aug 2019 14:02:15 +0300
Message-ID: <CAMegbcJfG4SvCG++Hf0P-4m4p8f6DQ0QHM78pU3xj2dux=AXEQ@mail.gmail.com>
Subject: Re: Asynchronous registration of binary metadata
To: dev@ignite.apache.org
Content-Type: multipart/alternative; boundary="000000000000d45dc70590c6bdd6"

--000000000000d45dc70590c6bdd6
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

Do I understand correctly what only affected requests with "dirty" metadata
will be delayed, but not all ?
Doesn't this check hurt performance? Otherwise ALL requests will be blocked
until some unrelated metadata is written which is highly undesirable.

Otherwise looks good if performance will not be affected by implementation.


=D1=87=D1=82, 22 =D0=B0=D0=B2=D0=B3. 2019 =D0=B3. =D0=B2 15:18, Denis Mekha=
nikov <dmekhanikov@gmail.com>:

> Alexey,
>
> Making only one node write metadata to disk synchronously is a possible
> and easy to implement solution, but it still has a few drawbacks:
>
> =E2=80=A2 Discovery will still be blocked on one node. This is better tha=
n
> blocking all nodes one by one, but disk write may take indefinite time, s=
o
> discovery may still be affected.
> =E2=80=A2 There is an unlikely but at the same time an unpleasant case:
>     1. A coordinator writes metadata synchronously to disk and finalizes
> the metadata registration. Other nodes do it asynchronously, so actual
> fsync to a disk may be delayed.
>     2. A transaction is committed.
>     3. The cluster is shut down before all nodes finish their fsync of
> metadata.
>     4. Nodes are started again one by one.
>     5. Before the previous coordinator is started again, a read operation
> tries to read the data, that uses the metadata that wasn=E2=80=99t fsynce=
d anywhere
> except the coordinator, which is still not started.
>     6. Error about unknown metadata is generated.
>
> In the scheme, that Sergey and me proposed, this situation isn=E2=80=99t =
possible,
> since the data won=E2=80=99t be written to disk until fsync is finished. =
Every
> mapped node will wait on a future until metadata is written to disk befor=
e
> performing any cache changes.
> What do you think about such fix?
>
> Denis
> On 22 Aug 2019, 12:44 +0300, Alexei Scherbakov <
> alexey.scherbakoff@gmail.com>, wrote:
> > Denis Mekhanikov,
> >
> > I think at least one node (coordinator for example) still should write
> > metadata synchronously to protect from a scenario:
> >
> > tx creating new metadata is commited <- all nodes in grid are failed
> > (powered off) <- async writing to disk is completed
> >
> > where <- means "happens before"
> >
> > All other nodes could write asynchronously, by using separate thread or
> not
> > doing fsync( same effect)
> >
> >
> >
> > =D1=81=D1=80, 21 =D0=B0=D0=B2=D0=B3. 2019 =D0=B3. =D0=B2 19:48, Denis M=
ekhanikov <dmekhanikov@gmail.com>:
> >
> > > Alexey,
> > >
> > > I=E2=80=99m not suggesting to duplicate anything.
> > > My point is that the proper fix will be implemented in a relatively
> > > distant future. Why not improve the existing mechanism now instead of
> > > waiting for the proper fix?
> > > If we don=E2=80=99t agree on doing this fix in master, I can do it in=
 a fork
> and
> > > use it in my setup. So please let me know if you see any other
> drawbacks in
> > > the proposed solution.
> > >
> > > Denis
> > >
> > > > On 21 Aug 2019, at 15:53, Alexei Scherbakov <
> > > alexey.scherbakoff@gmail.com> wrote:
> > > >
> > > > Denis Mekhanikov,
> > > >
> > > > If we are still talking about "proper" solution the metastore (I've
> meant
> > > > of course distributed one) is the way to go.
> > > >
> > > > It has a contract to store cluster wide metadata in most efficient
> way
> > > and
> > > > can have any optimization for concurrent writing inside.
> > > >
> > > > I'm against creating some duplicating mechanism as you suggested. W=
e
> do
> > > not
> > > > need another copy/paste code.
> > > >
> > > > Another possibility is to carry metadata along with appropriate
> request
> > > if
> > > > it's not found locally but this is a rather big modification.
> > > >
> > > >
> > > >
> > > > =D0=B2=D1=82, 20 =D0=B0=D0=B2=D0=B3. 2019 =D0=B3. =D0=B2 17:26, Den=
is Mekhanikov <dmekhanikov@gmail.com
> >:
> > > >
> > > > > Eduard,
> > > > >
> > > > > Usages will wait for the metadata to be registered and written to
> disk.
> > > No
> > > > > races should occur with such flow.
> > > > > Or do you have some specific case on your mind?
> > > > >
> > > > > I agree, that using a distributed meta storage would be nice here=
.
> > > > > But this way we will kind of move to the previous scheme with a
> > > replicated
> > > > > system cache, where metadata was stored before.
> > > > > Will scheme with the metastorage be different in any way? Won=E2=
=80=99t we
> > > decide
> > > > > to move back to discovery messages again after a while?
> > > > >
> > > > > Denis
> > > > >
> > > > >
> > > > > > On 20 Aug 2019, at 15:13, Eduard Shangareev <
> > > eduard.shangareev@gmail.com>
> > > > > wrote:
> > > > > >
> > > > > > Denis,
> > > > > > How would we deal with races between registration and metadata
> usages
> > > > > with
> > > > > > such fast-fix?
> > > > > >
> > > > > > I believe, that we need to move it to distributed metastorage,
> and
> > > await
> > > > > > registration completeness if we can't find it (wait for work in
> > > > > progress).
> > > > > > Discovery shouldn't wait for anything here.
> > > > > >
> > > > > > On Tue, Aug 20, 2019 at 11:55 AM Denis Mekhanikov <
> > > dmekhanikov@gmail.com
> > > > > >
> > > > > > wrote:
> > > > > >
> > > > > > > Sergey,
> > > > > > >
> > > > > > > Currently metadata is written to disk sequentially on every
> node. Only
> > > > > one
> > > > > > > node at a time is able to write metadata to its storage.
> > > > > > > Slowness accumulates when you add more nodes. A delay require=
d
> to
> > > write
> > > > > > > one piece of metadata may be not that big, but if you multipl=
y
> it by
> > > say
> > > > > > > 200, then it becomes noticeable.
> > > > > > > But If we move the writing out from discovery threads, then
> nodes will
> > > > > be
> > > > > > > doing it in parallel.
> > > > > > >
> > > > > > > I think, it=E2=80=99s better to block some threads from a str=
iped pool
> for a
> > > > > > > little while rather than blocking discovery for the same
> period, but
> > > > > > > multiplied by a number of nodes.
> > > > > > >
> > > > > > > What do you think?
> > > > > > >
> > > > > > > Denis
> > > > > > >
> > > > > > > > On 15 Aug 2019, at 10:26, Sergey Chugunov <
> sergey.chugunov@gmail.com
> > > >
> > > > > > > wrote:
> > > > > > > >
> > > > > > > > Denis,
> > > > > > > >
> > > > > > > > Thanks for bringing this issue up, decision to write binary
> metadata
> > > > > from
> > > > > > > > discovery thread was really a tough decision to make.
> > > > > > > > I don't think that moving metadata to metastorage is a
> silver bullet
> > > > > here
> > > > > > > > as this approach also has its drawbacks and is not an easy
> change.
> > > > > > > >
> > > > > > > > In addition to workarounds suggested by Alexei we have two
> choices to
> > > > > > > > offload write operation from discovery thread:
> > > > > > > >
> > > > > > > > 1. Your scheme with a separate writer thread and futures
> completed
> > > > > when
> > > > > > > > write operation is finished.
> > > > > > > > 2. PME-like protocol with obvious complications like
> failover and
> > > > > > > > asynchronous wait for replies over communication layer.
> > > > > > > >
> > > > > > > > Your suggestion looks easier from code complexity
> perspective but in
> > > my
> > > > > > > > view it increases chances to get into starvation. Now if
> some node
> > > > > faces
> > > > > > > > really long delays during write op it is gonna be kicked ou=
t
> of
> > > > > topology
> > > > > > > by
> > > > > > > > discovery protocol. In your case it is possible that more
> and more
> > > > > > > threads
> > > > > > > > from other pools may stuck waiting on the operation future,
> it is
> > > also
> > > > > > > not
> > > > > > > > good.
> > > > > > > >
> > > > > > > > What do you think?
> > > > > > > >
> > > > > > > > I also think that if we want to approach this issue
> systematically,
> > > we
> > > > > > > need
> > > > > > > > to do a deep analysis of metastorage option as well and to
> finally
> > > > > choose
> > > > > > > > which road we wanna go.
> > > > > > > >
> > > > > > > > Thanks!
> > > > > > > >
> > > > > > > > On Thu, Aug 15, 2019 at 9:28 AM Zhenya Stanilovsky
> > > > > > > > <arzamas123@mail.ru.invalid> wrote:
> > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > > 1. Yes, only on OS failures. In such case data will b=
e
> received
> > > from
> > > > > > > > > alive
> > > > > > > > > > > nodes later.
> > > > > > > > > What behavior would be in case of one node ? I suppose
> someone can
> > > > > > > obtain
> > > > > > > > > cache data without unmarshalling schema, what in this cas=
e
> would be
> > > > > with
> > > > > > > > > grid operability?
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > > 2. Yes, for walmode=3DFSYNC writes to metastore will =
be
> slow. But
> > > such
> > > > > > > > > mode
> > > > > > > > > > > should not be used if you have more than two nodes in
> grid because
> > > > > it
> > > > > > > > > has
> > > > > > > > > > > huge impact on performance.
> > > > > > > > > Is wal mode affects metadata store ?
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > =D1=81=D1=80, 14 =D0=B0=D0=B2=D0=B3. 2019 =D0=B3. =D0=
=B2 14:29, Denis Mekhanikov <
> > > > > dmekhanikov@gmail.com
> > > > > > > > > > :
> > > > > > > > > > >
> > > > > > > > > > > > Folks,
> > > > > > > > > > > >
> > > > > > > > > > > > Thanks for showing interest in this issue!
> > > > > > > > > > > >
> > > > > > > > > > > > Alexey,
> > > > > > > > > > > >
> > > > > > > > > > > > > I think removing fsync could help to mitigate
> performance issues
> > > > > > > with
> > > > > > > > > > > > current implementation
> > > > > > > > > > > >
> > > > > > > > > > > > Is my understanding correct, that if we remove
> fsync, then
> > > > > discovery
> > > > > > > > > won=E2=80=99t
> > > > > > > > > > > > be blocked, and data will be flushed to disk in
> background, and
> > > > > loss
> > > > > > > of
> > > > > > > > > > > > information will be possible only on OS failure? It
> sounds like
> > > an
> > > > > > > > > > > > acceptable workaround to me.
> > > > > > > > > > > >
> > > > > > > > > > > > Will moving metadata to metastore actually resolve
> this issue?
> > > > > Please
> > > > > > > > > > > > correct me if I=E2=80=99m wrong, but we will still =
need to
> write the
> > > > > > > > > information to
> > > > > > > > > > > > WAL before releasing the discovery thread. If WAL
> mode is FSYNC,
> > > > > then
> > > > > > > > > the
> > > > > > > > > > > > issue will still be there. Or is it planned to
> abandon the
> > > > > > > > > discovery-based
> > > > > > > > > > > > protocol at all?
> > > > > > > > > > > >
> > > > > > > > > > > > Evgeniy, Ivan,
> > > > > > > > > > > >
> > > > > > > > > > > > In my particular case the data wasn=E2=80=99t too b=
ig. It
> was a slow
> > > > > > > > > virtualised
> > > > > > > > > > > > disk with encryption, that made operations slow.
> Given that there
> > > > > are
> > > > > > > > > 200
> > > > > > > > > > > > nodes in a cluster, where every node writes slowly,
> and this
> > > > > process
> > > > > > > is
> > > > > > > > > > > > sequential, one piece of metadata is registered
> extremely slowly.
> > > > > > > > > > > >
> > > > > > > > > > > > Ivan, answering to your other questions:
> > > > > > > > > > > >
> > > > > > > > > > > > > 2. Do we need a persistent metadata for in-memory
> caches? Or is
> > > it
> > > > > > > so
> > > > > > > > > > > > accidentally?
> > > > > > > > > > > >
> > > > > > > > > > > > It should be checked, if it=E2=80=99s safe to stop =
writing
> marshaller
> > > > > > > mappings
> > > > > > > > > to
> > > > > > > > > > > > disk without loosing any guarantees.
> > > > > > > > > > > > But anyway, I would like to have a property, that
> would control
> > > > > this.
> > > > > > > > > If
> > > > > > > > > > > > metadata registration is slow, then initial cluster
> warmup may
> > > > > take a
> > > > > > > > > > > > while. So, if we preserve metadata on disk, then we
> will need to
> > > > > warm
> > > > > > > > > it up
> > > > > > > > > > > > only once, and further restarts won=E2=80=99t be af=
fected.
> > > > > > > > > > > >
> > > > > > > > > > > > > Do we really need a fast fix here?
> > > > > > > > > > > >
> > > > > > > > > > > > I would like a fix, that could be implemented now,
> since the
> > > > > activity
> > > > > > > > > with
> > > > > > > > > > > > moving metadata to metastore doesn=E2=80=99t sound =
like a
> quick one.
> > > > > Having a
> > > > > > > > > > > > temporary solution would be nice.
> > > > > > > > > > > >
> > > > > > > > > > > > Denis
> > > > > > > > > > > >
> > > > > > > > > > > > > On 14 Aug 2019, at 11:53, =D0=9F=D0=B0=D0=B2=D0=
=BB=D1=83=D1=85=D0=B8=D0=BD =D0=98=D0=B2=D0=B0=D0=BD <
> vololo100@gmail.com >
> > > > > > > > > wrote:
> > > > > > > > > > > > >
> > > > > > > > > > > > > Denis,
> > > > > > > > > > > > >
> > > > > > > > > > > > > Several clarifying questions:
> > > > > > > > > > > > > 1. Do you have an idea why metadata registration
> takes so long?
> > > So
> > > > > > > > > > > > > poor disks? So many data to write? A contention
> with disk writes
> > > > > by
> > > > > > > > > > > > > other subsystems?
> > > > > > > > > > > > > 2. Do we need a persistent metadata for in-memory
> caches? Or is
> > > it
> > > > > > > so
> > > > > > > > > > > > > accidentally?
> > > > > > > > > > > > >
> > > > > > > > > > > > > Generally, I think that it is possible to move
> metadata saving
> > > > > > > > > > > > > operations out of discovery thread without loosin=
g
> required
> > > > > > > > > > > > > consistency/integrity.
> > > > > > > > > > > > >
> > > > > > > > > > > > > As Alex mentioned using metastore looks like a
> better solution.
> > > Do
> > > > > > > we
> > > > > > > > > > > > > really need a fast fix here? (Are we talking abou=
t
> fast fix?)
> > > > > > > > > > > > >
> > > > > > > > > > > > > =D1=81=D1=80, 14 =D0=B0=D0=B2=D0=B3. 2019 =D0=B3.=
 =D0=B2 11:45, Zhenya Stanilovsky
> > > > > > > > > > > > < arzamas123@mail.ru.invalid >:
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Alexey, but in this case customer need to be
> informed, that
> > > whole
> > > > > > > > > (for
> > > > > > > > > > > > example 1 node) cluster crash (power off) could lea=
d
> to partial
> > > > > data
> > > > > > > > > > > > unavailability.
> > > > > > > > > > > > > > And may be further index corruption.
> > > > > > > > > > > > > > 1. Why your meta takes a substantial size? may
> be context
> > > > > leaking ?
> > > > > > > > > > > > > > 2. Could meta be compressed ?
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > > =D0=A1=D1=80=D0=B5=D0=B4=D0=B0, 14 =D0=B0=D0=
=B2=D0=B3=D1=83=D1=81=D1=82=D0=B0 2019, 11:22 +03:00 =D0=BE=D1=82 Alexei
> Scherbakov <
> > > > > > > > > > > > alexey.scherbakoff@gmail.com >:
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Denis Mekhanikov,
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Currently metadata are fsync'ed on write. Thi=
s
> might be the
> > > case
> > > > > > > of
> > > > > > > > > > > > > > > slow-downs in case of metadata burst writes.
> > > > > > > > > > > > > > > I think removing fsync could help to mitigate
> performance
> > > issues
> > > > > > > > > with
> > > > > > > > > > > > > > > current implementation until proper solution
> will be
> > > > > implemented:
> > > > > > > > > > > > moving
> > > > > > > > > > > > > > > metadata to metastore.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > =D0=B2=D1=82, 13 =D0=B0=D0=B2=D0=B3. 2019 =D0=
=B3. =D0=B2 17:09, Denis Mekhanikov <
> > > > > > > > > dmekhanikov@gmail.com
> > > > > > > > > > > > > :
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > I would also like to mention, that
> marshaller mappings are
> > > > > > > written
> > > > > > > > > to
> > > > > > > > > > > > disk
> > > > > > > > > > > > > > > > even if persistence is disabled.
> > > > > > > > > > > > > > > > So, this issue affects purely in-memory
> clusters as well.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Denis
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > On 13 Aug 2019, at 17:06, Denis Mekhaniko=
v
> <
> > > > > > > > > dmekhanikov@gmail.com >
> > > > > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > Hi!
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > When persistence is enabled, binary
> metadata is written to
> > > > > disk
> > > > > > > > > upon
> > > > > > > > > > > > > > > > registration. Currently it happens in the
> discovery thread,
> > > > > which
> > > > > > > > > > > > makes
> > > > > > > > > > > > > > > > processing of related messages very slow.
> > > > > > > > > > > > > > > > > There are cases, when a lot of nodes and
> slow disks can make
> > > > > > > every
> > > > > > > > > > > > > > > > binary type be registered for several
> minutes. Plus it blocks
> > > > > > > > > > > > processing of
> > > > > > > > > > > > > > > > other messages.
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > I propose starting a separate thread that
> will be
> > > responsible
> > > > > > > for
> > > > > > > > > > > > > > > > writing binary metadata to disk. So, binary
> type registration
> > > > > > > will
> > > > > > > > > be
> > > > > > > > > > > > > > > > considered finished before information abou=
t
> it will is
> > > written
> > > > > > > to
> > > > > > > > > > > > disks on
> > > > > > > > > > > > > > > > all nodes.
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > The main concern here is data consistency
> in cases when a
> > > node
> > > > > > > > > > > > > > > > acknowledges type registration and then
> fails before writing
> > > > > the
> > > > > > > > > > > > metadata
> > > > > > > > > > > > > > > > to disk.
> > > > > > > > > > > > > > > > > I see two parts of this issue:
> > > > > > > > > > > > > > > > > Nodes will have different metadata after
> restarting.
> > > > > > > > > > > > > > > > > If we write some data into a persisted
> cache and shut down
> > > > > nodes
> > > > > > > > > > > > faster
> > > > > > > > > > > > > > > > than a new binary type is written to disk,
> then after a
> > > restart
> > > > > > > we
> > > > > > > > > > > > won=E2=80=99t
> > > > > > > > > > > > > > > > have a binary type to work with.
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > The first case is similar to a situation,
> when one node
> > > fails,
> > > > > > > and
> > > > > > > > > > > > after
> > > > > > > > > > > > > > > > that a new type is registered in the
> cluster. This issue is
> > > > > > > > > resolved
> > > > > > > > > > > > by the
> > > > > > > > > > > > > > > > discovery data exchange. All nodes receive
> information about
> > > > > all
> > > > > > > > > > > > binary
> > > > > > > > > > > > > > > > types in the initial discovery messages sen=
t
> by other nodes.
> > > > > So,
> > > > > > > > > once
> > > > > > > > > > > > you
> > > > > > > > > > > > > > > > restart a node, it will receive information=
,
> that it failed
> > > to
> > > > > > > > > finish
> > > > > > > > > > > > > > > > writing to disk, from other nodes.
> > > > > > > > > > > > > > > > > If all nodes shut down before finishing
> writing the metadata
> > > > > to
> > > > > > > > > disk,
> > > > > > > > > > > > > > > > then after a restart the type will be
> considered
> > > unregistered,
> > > > > so
> > > > > > > > > > > > another
> > > > > > > > > > > > > > > > registration will be required.
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > The second case is a bit more complicated=
.
> But it can be
> > > > > > > resolved
> > > > > > > > > by
> > > > > > > > > > > > > > > > making the discovery threads on every node
> create a future,
> > > > > that
> > > > > > > > > will
> > > > > > > > > > > > be
> > > > > > > > > > > > > > > > completed when writing to disk is finished.
> So, every node
> > > will
> > > > > > > > > have
> > > > > > > > > > > > such
> > > > > > > > > > > > > > > > future, that will reflect the current state
> of persisting the
> > > > > > > > > > > > metadata to
> > > > > > > > > > > > > > > > disk.
> > > > > > > > > > > > > > > > > After that, if some operation needs this
> binary type, it
> > > will
> > > > > > > > > need to
> > > > > > > > > > > > > > > > wait on that future until flushing to disk
> is finished.
> > > > > > > > > > > > > > > > > This way discovery threads won=E2=80=99t =
be
> blocked, but other
> > > > > threads,
> > > > > > > > > that
> > > > > > > > > > > > > > > > actually need this type, will be.
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > Please let me know what you think about
> that.
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > Denis
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > --
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Best regards,
> > > > > > > > > > > > > > > Alexei Scherbakov
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > --
> > > > > > > > > > > > > > Zhenya Stanilovsky
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > --
> > > > > > > > > > > > > Best regards,
> > > > > > > > > > > > > Ivan Pavlukhin
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > --
> > > > > > > > > > >
> > > > > > > > > > > Best regards,
> > > > > > > > > > > Alexei Scherbakov
> > > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > --
> > > > > > > > > Zhenya Stanilovsky
> > > > > > > > >
> > > > > > >
> > > > > > >
> > > > >
> > > > >
> > > >
> > > > --
> > > >
> > > > Best regards,
> > > > Alexei Scherbakov
> > >
> > >
> >
> > --
> >
> > Best regards,
> > Alexei Scherbakov
>


--=20

Best regards,
Alexei Scherbakov

--000000000000d45dc70590c6bdd6--