ignite-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Denis Mekhanikov <dmekhani...@gmail.com>
Subject Re: Asynchronous registration of binary metadata
Date Thu, 22 Aug 2019 12:17:41 GMT
Alexey,

Making only one node write metadata to disk synchronously is a possible and easy to implement
solution, but it still has a few drawbacks:

• Discovery will still be blocked on one node. This is better than blocking all nodes one
by one, but disk write may take indefinite time, so discovery may still be affected.
• There is an unlikely but at the same time an unpleasant case:
    1. A coordinator writes metadata synchronously to disk and finalizes the metadata registration.
Other nodes do it asynchronously, so actual fsync to a disk may be delayed.
    2. A transaction is committed.
    3. The cluster is shut down before all nodes finish their fsync of metadata.
    4. Nodes are started again one by one.
    5. Before the previous coordinator is started again, a read operation tries to read the
data, that uses the metadata that wasn’t fsynced anywhere except the coordinator, which
is still not started.
    6. Error about unknown metadata is generated.

In the scheme, that Sergey and me proposed, this situation isn’t possible, since the data
won’t be written to disk until fsync is finished. Every mapped node will wait on a future
until metadata is written to disk before performing any cache changes.
What do you think about such fix?

Denis
On 22 Aug 2019, 12:44 +0300, Alexei Scherbakov <alexey.scherbakoff@gmail.com>, wrote:
> Denis Mekhanikov,
>
> I think at least one node (coordinator for example) still should write
> metadata synchronously to protect from a scenario:
>
> tx creating new metadata is commited <- all nodes in grid are failed
> (powered off) <- async writing to disk is completed
>
> where <- means "happens before"
>
> All other nodes could write asynchronously, by using separate thread or not
> doing fsync( same effect)
>
>
>
> ср, 21 авг. 2019 г. в 19:48, Denis Mekhanikov <dmekhanikov@gmail.com>:
>
> > Alexey,
> >
> > I’m not suggesting to duplicate anything.
> > My point is that the proper fix will be implemented in a relatively
> > distant future. Why not improve the existing mechanism now instead of
> > waiting for the proper fix?
> > If we don’t agree on doing this fix in master, I can do it in a fork and
> > use it in my setup. So please let me know if you see any other drawbacks in
> > the proposed solution.
> >
> > Denis
> >
> > > On 21 Aug 2019, at 15:53, Alexei Scherbakov <
> > alexey.scherbakoff@gmail.com> wrote:
> > >
> > > Denis Mekhanikov,
> > >
> > > If we are still talking about "proper" solution the metastore (I've meant
> > > of course distributed one) is the way to go.
> > >
> > > It has a contract to store cluster wide metadata in most efficient way
> > and
> > > can have any optimization for concurrent writing inside.
> > >
> > > I'm against creating some duplicating mechanism as you suggested. We do
> > not
> > > need another copy/paste code.
> > >
> > > Another possibility is to carry metadata along with appropriate request
> > if
> > > it's not found locally but this is a rather big modification.
> > >
> > >
> > >
> > > вт, 20 авг. 2019 г. в 17:26, Denis Mekhanikov <dmekhanikov@gmail.com>:
> > >
> > > > Eduard,
> > > >
> > > > Usages will wait for the metadata to be registered and written to disk.
> > No
> > > > races should occur with such flow.
> > > > Or do you have some specific case on your mind?
> > > >
> > > > I agree, that using a distributed meta storage would be nice here.
> > > > But this way we will kind of move to the previous scheme with a
> > replicated
> > > > system cache, where metadata was stored before.
> > > > Will scheme with the metastorage be different in any way? Won’t we
> > decide
> > > > to move back to discovery messages again after a while?
> > > >
> > > > Denis
> > > >
> > > >
> > > > > On 20 Aug 2019, at 15:13, Eduard Shangareev <
> > eduard.shangareev@gmail.com>
> > > > wrote:
> > > > >
> > > > > Denis,
> > > > > How would we deal with races between registration and metadata usages
> > > > with
> > > > > such fast-fix?
> > > > >
> > > > > I believe, that we need to move it to distributed metastorage, and
> > await
> > > > > registration completeness if we can't find it (wait for work in
> > > > progress).
> > > > > Discovery shouldn't wait for anything here.
> > > > >
> > > > > On Tue, Aug 20, 2019 at 11:55 AM Denis Mekhanikov <
> > dmekhanikov@gmail.com
> > > > >
> > > > > wrote:
> > > > >
> > > > > > Sergey,
> > > > > >
> > > > > > Currently metadata is written to disk sequentially on every
node. Only
> > > > one
> > > > > > node at a time is able to write metadata to its storage.
> > > > > > Slowness accumulates when you add more nodes. A delay required
to
> > write
> > > > > > one piece of metadata may be not that big, but if you multiply
it by
> > say
> > > > > > 200, then it becomes noticeable.
> > > > > > But If we move the writing out from discovery threads, then
nodes will
> > > > be
> > > > > > doing it in parallel.
> > > > > >
> > > > > > I think, it’s better to block some threads from a striped
pool for a
> > > > > > little while rather than blocking discovery for the same period,
but
> > > > > > multiplied by a number of nodes.
> > > > > >
> > > > > > What do you think?
> > > > > >
> > > > > > Denis
> > > > > >
> > > > > > > On 15 Aug 2019, at 10:26, Sergey Chugunov <sergey.chugunov@gmail.com
> > >
> > > > > > wrote:
> > > > > > >
> > > > > > > Denis,
> > > > > > >
> > > > > > > Thanks for bringing this issue up, decision to write binary
metadata
> > > > from
> > > > > > > discovery thread was really a tough decision to make.
> > > > > > > I don't think that moving metadata to metastorage is a
silver bullet
> > > > here
> > > > > > > as this approach also has its drawbacks and is not an easy
change.
> > > > > > >
> > > > > > > In addition to workarounds suggested by Alexei we have
two choices to
> > > > > > > offload write operation from discovery thread:
> > > > > > >
> > > > > > > 1. Your scheme with a separate writer thread and futures
completed
> > > > when
> > > > > > > write operation is finished.
> > > > > > > 2. PME-like protocol with obvious complications like failover
and
> > > > > > > asynchronous wait for replies over communication layer.
> > > > > > >
> > > > > > > Your suggestion looks easier from code complexity perspective
but in
> > my
> > > > > > > view it increases chances to get into starvation. Now if
some node
> > > > faces
> > > > > > > really long delays during write op it is gonna be kicked
out of
> > > > topology
> > > > > > by
> > > > > > > discovery protocol. In your case it is possible that more
and more
> > > > > > threads
> > > > > > > from other pools may stuck waiting on the operation future,
it is
> > also
> > > > > > not
> > > > > > > good.
> > > > > > >
> > > > > > > What do you think?
> > > > > > >
> > > > > > > I also think that if we want to approach this issue systematically,
> > we
> > > > > > need
> > > > > > > to do a deep analysis of metastorage option as well and
to finally
> > > > choose
> > > > > > > which road we wanna go.
> > > > > > >
> > > > > > > Thanks!
> > > > > > >
> > > > > > > On Thu, Aug 15, 2019 at 9:28 AM Zhenya Stanilovsky
> > > > > > > <arzamas123@mail.ru.invalid> wrote:
> > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > > > 1. Yes, only on OS failures. In such case
data will be received
> > from
> > > > > > > > alive
> > > > > > > > > > nodes later.
> > > > > > > > What behavior would be in case of one node ? I suppose
someone can
> > > > > > obtain
> > > > > > > > cache data without unmarshalling schema, what in this
case would be
> > > > with
> > > > > > > > grid operability?
> > > > > > > >
> > > > > > > > >
> > > > > > > > > > 2. Yes, for walmode=FSYNC writes to metastore
will be slow. But
> > such
> > > > > > > > mode
> > > > > > > > > > should not be used if you have more than
two nodes in grid because
> > > > it
> > > > > > > > has
> > > > > > > > > > huge impact on performance.
> > > > > > > > Is wal mode affects metadata store ?
> > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > ср, 14 авг. 2019 г. в 14:29, Denis
Mekhanikov <
> > > > dmekhanikov@gmail.com
> > > > > > > > > :
> > > > > > > > > >
> > > > > > > > > > > Folks,
> > > > > > > > > > >
> > > > > > > > > > > Thanks for showing interest in this
issue!
> > > > > > > > > > >
> > > > > > > > > > > Alexey,
> > > > > > > > > > >
> > > > > > > > > > > > I think removing fsync could help
to mitigate performance issues
> > > > > > with
> > > > > > > > > > > current implementation
> > > > > > > > > > >
> > > > > > > > > > > Is my understanding correct, that if
we remove fsync, then
> > > > discovery
> > > > > > > > won’t
> > > > > > > > > > > be blocked, and data will be flushed
to disk in background, and
> > > > loss
> > > > > > of
> > > > > > > > > > > information will be possible only on
OS failure? It sounds like
> > an
> > > > > > > > > > > acceptable workaround to me.
> > > > > > > > > > >
> > > > > > > > > > > Will moving metadata to metastore actually
resolve this issue?
> > > > Please
> > > > > > > > > > > correct me if I’m wrong, but we will
still need to write the
> > > > > > > > information to
> > > > > > > > > > > WAL before releasing the discovery
thread. If WAL mode is FSYNC,
> > > > then
> > > > > > > > the
> > > > > > > > > > > issue will still be there. Or is it
planned to abandon the
> > > > > > > > discovery-based
> > > > > > > > > > > protocol at all?
> > > > > > > > > > >
> > > > > > > > > > > Evgeniy, Ivan,
> > > > > > > > > > >
> > > > > > > > > > > In my particular case the data wasn’t
too big. It was a slow
> > > > > > > > virtualised
> > > > > > > > > > > disk with encryption, that made operations
slow. Given that there
> > > > are
> > > > > > > > 200
> > > > > > > > > > > nodes in a cluster, where every node
writes slowly, and this
> > > > process
> > > > > > is
> > > > > > > > > > > sequential, one piece of metadata is
registered extremely slowly.
> > > > > > > > > > >
> > > > > > > > > > > Ivan, answering to your other questions:
> > > > > > > > > > >
> > > > > > > > > > > > 2. Do we need a persistent metadata
for in-memory caches? Or is
> > it
> > > > > > so
> > > > > > > > > > > accidentally?
> > > > > > > > > > >
> > > > > > > > > > > It should be checked, if it’s safe
to stop writing marshaller
> > > > > > mappings
> > > > > > > > to
> > > > > > > > > > > disk without loosing any guarantees.
> > > > > > > > > > > But anyway, I would like to have a
property, that would control
> > > > this.
> > > > > > > > If
> > > > > > > > > > > metadata registration is slow, then
initial cluster warmup may
> > > > take a
> > > > > > > > > > > while. So, if we preserve metadata
on disk, then we will need to
> > > > warm
> > > > > > > > it up
> > > > > > > > > > > only once, and further restarts won’t
be affected.
> > > > > > > > > > >
> > > > > > > > > > > > Do we really need a fast fix here?
> > > > > > > > > > >
> > > > > > > > > > > I would like a fix, that could be implemented
now, since the
> > > > activity
> > > > > > > > with
> > > > > > > > > > > moving metadata to metastore doesn’t
sound like a quick one.
> > > > Having a
> > > > > > > > > > > temporary solution would be nice.
> > > > > > > > > > >
> > > > > > > > > > > Denis
> > > > > > > > > > >
> > > > > > > > > > > > On 14 Aug 2019, at 11:53, Павлухин
Иван < vololo100@gmail.com >
> > > > > > > > wrote:
> > > > > > > > > > > >
> > > > > > > > > > > > Denis,
> > > > > > > > > > > >
> > > > > > > > > > > > Several clarifying questions:
> > > > > > > > > > > > 1. Do you have an idea why metadata
registration takes so long?
> > So
> > > > > > > > > > > > poor disks? So many data to write?
A contention with disk writes
> > > > by
> > > > > > > > > > > > other subsystems?
> > > > > > > > > > > > 2. Do we need a persistent metadata
for in-memory caches? Or is
> > it
> > > > > > so
> > > > > > > > > > > > accidentally?
> > > > > > > > > > > >
> > > > > > > > > > > > Generally, I think that it is
possible to move metadata saving
> > > > > > > > > > > > operations out of discovery thread
without loosing required
> > > > > > > > > > > > consistency/integrity.
> > > > > > > > > > > >
> > > > > > > > > > > > As Alex mentioned using metastore
looks like a better solution.
> > Do
> > > > > > we
> > > > > > > > > > > > really need a fast fix here? (Are
we talking about fast fix?)
> > > > > > > > > > > >
> > > > > > > > > > > > ср, 14 авг. 2019 г. в 11:45,
Zhenya Stanilovsky
> > > > > > > > > > > < arzamas123@mail.ru.invalid >:
> > > > > > > > > > > > >
> > > > > > > > > > > > > Alexey, but in this case
customer need to be informed, that
> > whole
> > > > > > > > (for
> > > > > > > > > > > example 1 node) cluster crash (power
off) could lead to partial
> > > > data
> > > > > > > > > > > unavailability.
> > > > > > > > > > > > > And may be further index
corruption.
> > > > > > > > > > > > > 1. Why your meta takes a
substantial size? may be context
> > > > leaking ?
> > > > > > > > > > > > > 2. Could meta be compressed
?
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > > Среда, 14 августа
2019, 11:22 +03:00 от Alexei Scherbakov <
> > > > > > > > > > > alexey.scherbakoff@gmail.com >:
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Denis Mekhanikov,
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Currently metadata are
fsync'ed on write. This might be the
> > case
> > > > > > of
> > > > > > > > > > > > > > slow-downs in case of
metadata burst writes.
> > > > > > > > > > > > > > I think removing fsync
could help to mitigate performance
> > issues
> > > > > > > > with
> > > > > > > > > > > > > > current implementation
until proper solution will be
> > > > implemented:
> > > > > > > > > > > moving
> > > > > > > > > > > > > > metadata to metastore.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > вт, 13 авг. 2019
г. в 17:09, Denis Mekhanikov <
> > > > > > > > dmekhanikov@gmail.com
> > > > > > > > > > > > :
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > > I would also like
to mention, that marshaller mappings are
> > > > > > written
> > > > > > > > to
> > > > > > > > > > > disk
> > > > > > > > > > > > > > > even if persistence
is disabled.
> > > > > > > > > > > > > > > So, this issue
affects purely in-memory clusters as well.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Denis
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > On 13 Aug
2019, at 17:06, Denis Mekhanikov <
> > > > > > > > dmekhanikov@gmail.com >
> > > > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Hi!
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > When persistence
is enabled, binary metadata is written to
> > > > disk
> > > > > > > > upon
> > > > > > > > > > > > > > > registration. Currently
it happens in the discovery thread,
> > > > which
> > > > > > > > > > > makes
> > > > > > > > > > > > > > > processing of related
messages very slow.
> > > > > > > > > > > > > > > > There are
cases, when a lot of nodes and slow disks can make
> > > > > > every
> > > > > > > > > > > > > > > binary type be
registered for several minutes. Plus it blocks
> > > > > > > > > > > processing of
> > > > > > > > > > > > > > > other messages.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > I propose
starting a separate thread that will be
> > responsible
> > > > > > for
> > > > > > > > > > > > > > > writing binary
metadata to disk. So, binary type registration
> > > > > > will
> > > > > > > > be
> > > > > > > > > > > > > > > considered finished
before information about it will is
> > written
> > > > > > to
> > > > > > > > > > > disks on
> > > > > > > > > > > > > > > all nodes.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > The main concern
here is data consistency in cases when a
> > node
> > > > > > > > > > > > > > > acknowledges type
registration and then fails before writing
> > > > the
> > > > > > > > > > > metadata
> > > > > > > > > > > > > > > to disk.
> > > > > > > > > > > > > > > > I see two
parts of this issue:
> > > > > > > > > > > > > > > > Nodes will
have different metadata after restarting.
> > > > > > > > > > > > > > > > If we write
some data into a persisted cache and shut down
> > > > nodes
> > > > > > > > > > > faster
> > > > > > > > > > > > > > > than a new binary
type is written to disk, then after a
> > restart
> > > > > > we
> > > > > > > > > > > won’t
> > > > > > > > > > > > > > > have a binary type
to work with.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > The first
case is similar to a situation, when one node
> > fails,
> > > > > > and
> > > > > > > > > > > after
> > > > > > > > > > > > > > > that a new type
is registered in the cluster. This issue is
> > > > > > > > resolved
> > > > > > > > > > > by the
> > > > > > > > > > > > > > > discovery data
exchange. All nodes receive information about
> > > > all
> > > > > > > > > > > binary
> > > > > > > > > > > > > > > types in the initial
discovery messages sent by other nodes.
> > > > So,
> > > > > > > > once
> > > > > > > > > > > you
> > > > > > > > > > > > > > > restart a node,
it will receive information, that it failed
> > to
> > > > > > > > finish
> > > > > > > > > > > > > > > writing to disk,
from other nodes.
> > > > > > > > > > > > > > > > If all nodes
shut down before finishing writing the metadata
> > > > to
> > > > > > > > disk,
> > > > > > > > > > > > > > > then after a restart
the type will be considered
> > unregistered,
> > > > so
> > > > > > > > > > > another
> > > > > > > > > > > > > > > registration will
be required.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > The second
case is a bit more complicated. But it can be
> > > > > > resolved
> > > > > > > > by
> > > > > > > > > > > > > > > making the discovery
threads on every node create a future,
> > > > that
> > > > > > > > will
> > > > > > > > > > > be
> > > > > > > > > > > > > > > completed when
writing to disk is finished. So, every node
> > will
> > > > > > > > have
> > > > > > > > > > > such
> > > > > > > > > > > > > > > future, that will
reflect the current state of persisting the
> > > > > > > > > > > metadata to
> > > > > > > > > > > > > > > disk.
> > > > > > > > > > > > > > > > After that,
if some operation needs this binary type, it
> > will
> > > > > > > > need to
> > > > > > > > > > > > > > > wait on that future
until flushing to disk is finished.
> > > > > > > > > > > > > > > > This way discovery
threads won’t be blocked, but other
> > > > threads,
> > > > > > > > that
> > > > > > > > > > > > > > > actually need this
type, will be.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Please let
me know what you think about that.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Denis
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > --
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Best regards,
> > > > > > > > > > > > > > Alexei Scherbakov
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > --
> > > > > > > > > > > > > Zhenya Stanilovsky
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > --
> > > > > > > > > > > > Best regards,
> > > > > > > > > > > > Ivan Pavlukhin
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > --
> > > > > > > > > >
> > > > > > > > > > Best regards,
> > > > > > > > > > Alexei Scherbakov
> > > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > --
> > > > > > > > Zhenya Stanilovsky
> > > > > > > >
> > > > > >
> > > > > >
> > > >
> > > >
> > >
> > > --
> > >
> > > Best regards,
> > > Alexei Scherbakov
> >
> >
>
> --
>
> Best regards,
> Alexei Scherbakov

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message