ignite-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Alexei Scherbakov <alexey.scherbak...@gmail.com>
Subject Re: Asynchronous registration of binary metadata
Date Fri, 23 Aug 2019 11:02:15 GMT
Do I understand correctly what only affected requests with "dirty" metadata
will be delayed, but not all ?
Doesn't this check hurt performance? Otherwise ALL requests will be blocked
until some unrelated metadata is written which is highly undesirable.

Otherwise looks good if performance will not be affected by implementation.


чт, 22 авг. 2019 г. в 15:18, Denis Mekhanikov <dmekhanikov@gmail.com>:

> Alexey,
>
> Making only one node write metadata to disk synchronously is a possible
> and easy to implement solution, but it still has a few drawbacks:
>
> • Discovery will still be blocked on one node. This is better than
> blocking all nodes one by one, but disk write may take indefinite time, so
> discovery may still be affected.
> • There is an unlikely but at the same time an unpleasant case:
>     1. A coordinator writes metadata synchronously to disk and finalizes
> the metadata registration. Other nodes do it asynchronously, so actual
> fsync to a disk may be delayed.
>     2. A transaction is committed.
>     3. The cluster is shut down before all nodes finish their fsync of
> metadata.
>     4. Nodes are started again one by one.
>     5. Before the previous coordinator is started again, a read operation
> tries to read the data, that uses the metadata that wasn’t fsynced anywhere
> except the coordinator, which is still not started.
>     6. Error about unknown metadata is generated.
>
> In the scheme, that Sergey and me proposed, this situation isn’t possible,
> since the data won’t be written to disk until fsync is finished. Every
> mapped node will wait on a future until metadata is written to disk before
> performing any cache changes.
> What do you think about such fix?
>
> Denis
> On 22 Aug 2019, 12:44 +0300, Alexei Scherbakov <
> alexey.scherbakoff@gmail.com>, wrote:
> > Denis Mekhanikov,
> >
> > I think at least one node (coordinator for example) still should write
> > metadata synchronously to protect from a scenario:
> >
> > tx creating new metadata is commited <- all nodes in grid are failed
> > (powered off) <- async writing to disk is completed
> >
> > where <- means "happens before"
> >
> > All other nodes could write asynchronously, by using separate thread or
> not
> > doing fsync( same effect)
> >
> >
> >
> > ср, 21 авг. 2019 г. в 19:48, Denis Mekhanikov <dmekhanikov@gmail.com>:
> >
> > > Alexey,
> > >
> > > I’m not suggesting to duplicate anything.
> > > My point is that the proper fix will be implemented in a relatively
> > > distant future. Why not improve the existing mechanism now instead of
> > > waiting for the proper fix?
> > > If we don’t agree on doing this fix in master, I can do it in a fork
> and
> > > use it in my setup. So please let me know if you see any other
> drawbacks in
> > > the proposed solution.
> > >
> > > Denis
> > >
> > > > On 21 Aug 2019, at 15:53, Alexei Scherbakov <
> > > alexey.scherbakoff@gmail.com> wrote:
> > > >
> > > > Denis Mekhanikov,
> > > >
> > > > If we are still talking about "proper" solution the metastore (I've
> meant
> > > > of course distributed one) is the way to go.
> > > >
> > > > It has a contract to store cluster wide metadata in most efficient
> way
> > > and
> > > > can have any optimization for concurrent writing inside.
> > > >
> > > > I'm against creating some duplicating mechanism as you suggested. We
> do
> > > not
> > > > need another copy/paste code.
> > > >
> > > > Another possibility is to carry metadata along with appropriate
> request
> > > if
> > > > it's not found locally but this is a rather big modification.
> > > >
> > > >
> > > >
> > > > вт, 20 авг. 2019 г. в 17:26, Denis Mekhanikov <dmekhanikov@gmail.com
> >:
> > > >
> > > > > Eduard,
> > > > >
> > > > > Usages will wait for the metadata to be registered and written to
> disk.
> > > No
> > > > > races should occur with such flow.
> > > > > Or do you have some specific case on your mind?
> > > > >
> > > > > I agree, that using a distributed meta storage would be nice here.
> > > > > But this way we will kind of move to the previous scheme with a
> > > replicated
> > > > > system cache, where metadata was stored before.
> > > > > Will scheme with the metastorage be different in any way? Won’t
we
> > > decide
> > > > > to move back to discovery messages again after a while?
> > > > >
> > > > > Denis
> > > > >
> > > > >
> > > > > > On 20 Aug 2019, at 15:13, Eduard Shangareev <
> > > eduard.shangareev@gmail.com>
> > > > > wrote:
> > > > > >
> > > > > > Denis,
> > > > > > How would we deal with races between registration and metadata
> usages
> > > > > with
> > > > > > such fast-fix?
> > > > > >
> > > > > > I believe, that we need to move it to distributed metastorage,
> and
> > > await
> > > > > > registration completeness if we can't find it (wait for work
in
> > > > > progress).
> > > > > > Discovery shouldn't wait for anything here.
> > > > > >
> > > > > > On Tue, Aug 20, 2019 at 11:55 AM Denis Mekhanikov <
> > > dmekhanikov@gmail.com
> > > > > >
> > > > > > wrote:
> > > > > >
> > > > > > > Sergey,
> > > > > > >
> > > > > > > Currently metadata is written to disk sequentially on every
> node. Only
> > > > > one
> > > > > > > node at a time is able to write metadata to its storage.
> > > > > > > Slowness accumulates when you add more nodes. A delay required
> to
> > > write
> > > > > > > one piece of metadata may be not that big, but if you multiply
> it by
> > > say
> > > > > > > 200, then it becomes noticeable.
> > > > > > > But If we move the writing out from discovery threads,
then
> nodes will
> > > > > be
> > > > > > > doing it in parallel.
> > > > > > >
> > > > > > > I think, it’s better to block some threads from a striped
pool
> for a
> > > > > > > little while rather than blocking discovery for the same
> period, but
> > > > > > > multiplied by a number of nodes.
> > > > > > >
> > > > > > > What do you think?
> > > > > > >
> > > > > > > Denis
> > > > > > >
> > > > > > > > On 15 Aug 2019, at 10:26, Sergey Chugunov <
> sergey.chugunov@gmail.com
> > > >
> > > > > > > wrote:
> > > > > > > >
> > > > > > > > Denis,
> > > > > > > >
> > > > > > > > Thanks for bringing this issue up, decision to write
binary
> metadata
> > > > > from
> > > > > > > > discovery thread was really a tough decision to make.
> > > > > > > > I don't think that moving metadata to metastorage
is a
> silver bullet
> > > > > here
> > > > > > > > as this approach also has its drawbacks and is not
an easy
> change.
> > > > > > > >
> > > > > > > > In addition to workarounds suggested by Alexei we
have two
> choices to
> > > > > > > > offload write operation from discovery thread:
> > > > > > > >
> > > > > > > > 1. Your scheme with a separate writer thread and futures
> completed
> > > > > when
> > > > > > > > write operation is finished.
> > > > > > > > 2. PME-like protocol with obvious complications like
> failover and
> > > > > > > > asynchronous wait for replies over communication layer.
> > > > > > > >
> > > > > > > > Your suggestion looks easier from code complexity
> perspective but in
> > > my
> > > > > > > > view it increases chances to get into starvation.
Now if
> some node
> > > > > faces
> > > > > > > > really long delays during write op it is gonna be
kicked out
> of
> > > > > topology
> > > > > > > by
> > > > > > > > discovery protocol. In your case it is possible that
more
> and more
> > > > > > > threads
> > > > > > > > from other pools may stuck waiting on the operation
future,
> it is
> > > also
> > > > > > > not
> > > > > > > > good.
> > > > > > > >
> > > > > > > > What do you think?
> > > > > > > >
> > > > > > > > I also think that if we want to approach this issue
> systematically,
> > > we
> > > > > > > need
> > > > > > > > to do a deep analysis of metastorage option as well
and to
> finally
> > > > > choose
> > > > > > > > which road we wanna go.
> > > > > > > >
> > > > > > > > Thanks!
> > > > > > > >
> > > > > > > > On Thu, Aug 15, 2019 at 9:28 AM Zhenya Stanilovsky
> > > > > > > > <arzamas123@mail.ru.invalid> wrote:
> > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > > 1. Yes, only on OS failures. In such
case data will be
> received
> > > from
> > > > > > > > > alive
> > > > > > > > > > > nodes later.
> > > > > > > > > What behavior would be in case of one node ?
I suppose
> someone can
> > > > > > > obtain
> > > > > > > > > cache data without unmarshalling schema, what
in this case
> would be
> > > > > with
> > > > > > > > > grid operability?
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > > 2. Yes, for walmode=FSYNC writes to
metastore will be
> slow. But
> > > such
> > > > > > > > > mode
> > > > > > > > > > > should not be used if you have more
than two nodes in
> grid because
> > > > > it
> > > > > > > > > has
> > > > > > > > > > > huge impact on performance.
> > > > > > > > > Is wal mode affects metadata store ?
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > ср, 14 авг. 2019 г. в 14:29,
Denis Mekhanikov <
> > > > > dmekhanikov@gmail.com
> > > > > > > > > > :
> > > > > > > > > > >
> > > > > > > > > > > > Folks,
> > > > > > > > > > > >
> > > > > > > > > > > > Thanks for showing interest in
this issue!
> > > > > > > > > > > >
> > > > > > > > > > > > Alexey,
> > > > > > > > > > > >
> > > > > > > > > > > > > I think removing fsync could
help to mitigate
> performance issues
> > > > > > > with
> > > > > > > > > > > > current implementation
> > > > > > > > > > > >
> > > > > > > > > > > > Is my understanding correct, that
if we remove
> fsync, then
> > > > > discovery
> > > > > > > > > won’t
> > > > > > > > > > > > be blocked, and data will be flushed
to disk in
> background, and
> > > > > loss
> > > > > > > of
> > > > > > > > > > > > information will be possible only
on OS failure? It
> sounds like
> > > an
> > > > > > > > > > > > acceptable workaround to me.
> > > > > > > > > > > >
> > > > > > > > > > > > Will moving metadata to metastore
actually resolve
> this issue?
> > > > > Please
> > > > > > > > > > > > correct me if I’m wrong, but
we will still need to
> write the
> > > > > > > > > information to
> > > > > > > > > > > > WAL before releasing the discovery
thread. If WAL
> mode is FSYNC,
> > > > > then
> > > > > > > > > the
> > > > > > > > > > > > issue will still be there. Or
is it planned to
> abandon the
> > > > > > > > > discovery-based
> > > > > > > > > > > > protocol at all?
> > > > > > > > > > > >
> > > > > > > > > > > > Evgeniy, Ivan,
> > > > > > > > > > > >
> > > > > > > > > > > > In my particular case the data
wasn’t too big. It
> was a slow
> > > > > > > > > virtualised
> > > > > > > > > > > > disk with encryption, that made
operations slow.
> Given that there
> > > > > are
> > > > > > > > > 200
> > > > > > > > > > > > nodes in a cluster, where every
node writes slowly,
> and this
> > > > > process
> > > > > > > is
> > > > > > > > > > > > sequential, one piece of metadata
is registered
> extremely slowly.
> > > > > > > > > > > >
> > > > > > > > > > > > Ivan, answering to your other
questions:
> > > > > > > > > > > >
> > > > > > > > > > > > > 2. Do we need a persistent
metadata for in-memory
> caches? Or is
> > > it
> > > > > > > so
> > > > > > > > > > > > accidentally?
> > > > > > > > > > > >
> > > > > > > > > > > > It should be checked, if it’s
safe to stop writing
> marshaller
> > > > > > > mappings
> > > > > > > > > to
> > > > > > > > > > > > disk without loosing any guarantees.
> > > > > > > > > > > > But anyway, I would like to have
a property, that
> would control
> > > > > this.
> > > > > > > > > If
> > > > > > > > > > > > metadata registration is slow,
then initial cluster
> warmup may
> > > > > take a
> > > > > > > > > > > > while. So, if we preserve metadata
on disk, then we
> will need to
> > > > > warm
> > > > > > > > > it up
> > > > > > > > > > > > only once, and further restarts
won’t be affected.
> > > > > > > > > > > >
> > > > > > > > > > > > > Do we really need a fast
fix here?
> > > > > > > > > > > >
> > > > > > > > > > > > I would like a fix, that could
be implemented now,
> since the
> > > > > activity
> > > > > > > > > with
> > > > > > > > > > > > moving metadata to metastore doesn’t
sound like a
> quick one.
> > > > > Having a
> > > > > > > > > > > > temporary solution would be nice.
> > > > > > > > > > > >
> > > > > > > > > > > > Denis
> > > > > > > > > > > >
> > > > > > > > > > > > > On 14 Aug 2019, at 11:53,
Павлухин Иван <
> vololo100@gmail.com >
> > > > > > > > > wrote:
> > > > > > > > > > > > >
> > > > > > > > > > > > > Denis,
> > > > > > > > > > > > >
> > > > > > > > > > > > > Several clarifying questions:
> > > > > > > > > > > > > 1. Do you have an idea why
metadata registration
> takes so long?
> > > So
> > > > > > > > > > > > > poor disks? So many data
to write? A contention
> with disk writes
> > > > > by
> > > > > > > > > > > > > other subsystems?
> > > > > > > > > > > > > 2. Do we need a persistent
metadata for in-memory
> caches? Or is
> > > it
> > > > > > > so
> > > > > > > > > > > > > accidentally?
> > > > > > > > > > > > >
> > > > > > > > > > > > > Generally, I think that it
is possible to move
> metadata saving
> > > > > > > > > > > > > operations out of discovery
thread without loosing
> required
> > > > > > > > > > > > > consistency/integrity.
> > > > > > > > > > > > >
> > > > > > > > > > > > > As Alex mentioned using metastore
looks like a
> better solution.
> > > Do
> > > > > > > we
> > > > > > > > > > > > > really need a fast fix here?
(Are we talking about
> fast fix?)
> > > > > > > > > > > > >
> > > > > > > > > > > > > ср, 14 авг. 2019 г.
в 11:45, Zhenya Stanilovsky
> > > > > > > > > > > > < arzamas123@mail.ru.invalid
>:
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Alexey, but in this
case customer need to be
> informed, that
> > > whole
> > > > > > > > > (for
> > > > > > > > > > > > example 1 node) cluster crash
(power off) could lead
> to partial
> > > > > data
> > > > > > > > > > > > unavailability.
> > > > > > > > > > > > > > And may be further index
corruption.
> > > > > > > > > > > > > > 1. Why your meta takes
a substantial size? may
> be context
> > > > > leaking ?
> > > > > > > > > > > > > > 2. Could meta be compressed
?
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Среда, 14
августа 2019, 11:22 +03:00 от Alexei
> Scherbakov <
> > > > > > > > > > > > alexey.scherbakoff@gmail.com >:
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Denis Mekhanikov,
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Currently metadata
are fsync'ed on write. This
> might be the
> > > case
> > > > > > > of
> > > > > > > > > > > > > > > slow-downs in case
of metadata burst writes.
> > > > > > > > > > > > > > > I think removing
fsync could help to mitigate
> performance
> > > issues
> > > > > > > > > with
> > > > > > > > > > > > > > > current implementation
until proper solution
> will be
> > > > > implemented:
> > > > > > > > > > > > moving
> > > > > > > > > > > > > > > metadata to metastore.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > вт, 13 авг.
2019 г. в 17:09, Denis Mekhanikov <
> > > > > > > > > dmekhanikov@gmail.com
> > > > > > > > > > > > > :
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > I would also
like to mention, that
> marshaller mappings are
> > > > > > > written
> > > > > > > > > to
> > > > > > > > > > > > disk
> > > > > > > > > > > > > > > > even if persistence
is disabled.
> > > > > > > > > > > > > > > > So, this issue
affects purely in-memory
> clusters as well.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Denis
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > On 13
Aug 2019, at 17:06, Denis Mekhanikov
> <
> > > > > > > > > dmekhanikov@gmail.com >
> > > > > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > Hi!
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > When
persistence is enabled, binary
> metadata is written to
> > > > > disk
> > > > > > > > > upon
> > > > > > > > > > > > > > > > registration.
Currently it happens in the
> discovery thread,
> > > > > which
> > > > > > > > > > > > makes
> > > > > > > > > > > > > > > > processing
of related messages very slow.
> > > > > > > > > > > > > > > > > There
are cases, when a lot of nodes and
> slow disks can make
> > > > > > > every
> > > > > > > > > > > > > > > > binary type
be registered for several
> minutes. Plus it blocks
> > > > > > > > > > > > processing of
> > > > > > > > > > > > > > > > other messages.
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > I propose
starting a separate thread that
> will be
> > > responsible
> > > > > > > for
> > > > > > > > > > > > > > > > writing binary
metadata to disk. So, binary
> type registration
> > > > > > > will
> > > > > > > > > be
> > > > > > > > > > > > > > > > considered
finished before information about
> it will is
> > > written
> > > > > > > to
> > > > > > > > > > > > disks on
> > > > > > > > > > > > > > > > all nodes.
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > The main
concern here is data consistency
> in cases when a
> > > node
> > > > > > > > > > > > > > > > acknowledges
type registration and then
> fails before writing
> > > > > the
> > > > > > > > > > > > metadata
> > > > > > > > > > > > > > > > to disk.
> > > > > > > > > > > > > > > > > I see
two parts of this issue:
> > > > > > > > > > > > > > > > > Nodes
will have different metadata after
> restarting.
> > > > > > > > > > > > > > > > > If we
write some data into a persisted
> cache and shut down
> > > > > nodes
> > > > > > > > > > > > faster
> > > > > > > > > > > > > > > > than a new
binary type is written to disk,
> then after a
> > > restart
> > > > > > > we
> > > > > > > > > > > > won’t
> > > > > > > > > > > > > > > > have a binary
type to work with.
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > The first
case is similar to a situation,
> when one node
> > > fails,
> > > > > > > and
> > > > > > > > > > > > after
> > > > > > > > > > > > > > > > that a new
type is registered in the
> cluster. This issue is
> > > > > > > > > resolved
> > > > > > > > > > > > by the
> > > > > > > > > > > > > > > > discovery
data exchange. All nodes receive
> information about
> > > > > all
> > > > > > > > > > > > binary
> > > > > > > > > > > > > > > > types in the
initial discovery messages sent
> by other nodes.
> > > > > So,
> > > > > > > > > once
> > > > > > > > > > > > you
> > > > > > > > > > > > > > > > restart a
node, it will receive information,
> that it failed
> > > to
> > > > > > > > > finish
> > > > > > > > > > > > > > > > writing to
disk, from other nodes.
> > > > > > > > > > > > > > > > > If all
nodes shut down before finishing
> writing the metadata
> > > > > to
> > > > > > > > > disk,
> > > > > > > > > > > > > > > > then after
a restart the type will be
> considered
> > > unregistered,
> > > > > so
> > > > > > > > > > > > another
> > > > > > > > > > > > > > > > registration
will be required.
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > The second
case is a bit more complicated.
> But it can be
> > > > > > > resolved
> > > > > > > > > by
> > > > > > > > > > > > > > > > making the
discovery threads on every node
> create a future,
> > > > > that
> > > > > > > > > will
> > > > > > > > > > > > be
> > > > > > > > > > > > > > > > completed
when writing to disk is finished.
> So, every node
> > > will
> > > > > > > > > have
> > > > > > > > > > > > such
> > > > > > > > > > > > > > > > future, that
will reflect the current state
> of persisting the
> > > > > > > > > > > > metadata to
> > > > > > > > > > > > > > > > disk.
> > > > > > > > > > > > > > > > > After
that, if some operation needs this
> binary type, it
> > > will
> > > > > > > > > need to
> > > > > > > > > > > > > > > > wait on that
future until flushing to disk
> is finished.
> > > > > > > > > > > > > > > > > This
way discovery threads won’t be
> blocked, but other
> > > > > threads,
> > > > > > > > > that
> > > > > > > > > > > > > > > > actually need
this type, will be.
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > Please
let me know what you think about
> that.
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > Denis
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > --
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Best regards,
> > > > > > > > > > > > > > > Alexei Scherbakov
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > --
> > > > > > > > > > > > > > Zhenya Stanilovsky
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > --
> > > > > > > > > > > > > Best regards,
> > > > > > > > > > > > > Ivan Pavlukhin
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > --
> > > > > > > > > > >
> > > > > > > > > > > Best regards,
> > > > > > > > > > > Alexei Scherbakov
> > > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > --
> > > > > > > > > Zhenya Stanilovsky
> > > > > > > > >
> > > > > > >
> > > > > > >
> > > > >
> > > > >
> > > >
> > > > --
> > > >
> > > > Best regards,
> > > > Alexei Scherbakov
> > >
> > >
> >
> > --
> >
> > Best regards,
> > Alexei Scherbakov
>


-- 

Best regards,
Alexei Scherbakov

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message