ignite-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Alexei Scherbakov <alexey.scherbak...@gmail.com>
Subject Re: Asynchronous registration of binary metadata
Date Thu, 22 Aug 2019 09:43:53 GMT
Denis Mekhanikov,

I think at least one node (coordinator for example)  still should write
metadata synchronously to protect from a scenario:

tx creating new metadata is commited <- all nodes in grid are failed
(powered off) <- async writing to disk is completed

where <- means "happens before"

All other nodes could write asynchronously, by using separate thread or not
doing fsync( same effect)



ср, 21 авг. 2019 г. в 19:48, Denis Mekhanikov <dmekhanikov@gmail.com>:

> Alexey,
>
> I’m not suggesting to duplicate anything.
> My point is that the proper fix will be implemented in a relatively
> distant future. Why not improve the existing mechanism now instead of
> waiting for the proper fix?
> If we don’t agree on doing this fix in master, I can do it in a fork and
> use it in my setup. So please let me know if you see any other drawbacks in
> the proposed solution.
>
> Denis
>
> > On 21 Aug 2019, at 15:53, Alexei Scherbakov <
> alexey.scherbakoff@gmail.com> wrote:
> >
> > Denis Mekhanikov,
> >
> > If we are still talking about "proper" solution the metastore (I've meant
> > of course distributed one) is the way to go.
> >
> > It has a contract to store cluster wide metadata in most efficient way
> and
> > can have any optimization for concurrent writing inside.
> >
> > I'm against creating some duplicating mechanism as you suggested. We do
> not
> > need another copy/paste code.
> >
> > Another possibility is to carry metadata along with appropriate request
> if
> > it's not found locally but this is a rather big modification.
> >
> >
> >
> > вт, 20 авг. 2019 г. в 17:26, Denis Mekhanikov <dmekhanikov@gmail.com>:
> >
> >> Eduard,
> >>
> >> Usages will wait for the metadata to be registered and written to disk.
> No
> >> races should occur with such flow.
> >> Or do you have some specific case on your mind?
> >>
> >> I agree, that using a distributed meta storage would be nice here.
> >> But this way we will kind of move to the previous scheme with a
> replicated
> >> system cache, where metadata was stored before.
> >> Will scheme with the metastorage be different in any way? Won’t we
> decide
> >> to move back to discovery messages again after a while?
> >>
> >> Denis
> >>
> >>
> >>> On 20 Aug 2019, at 15:13, Eduard Shangareev <
> eduard.shangareev@gmail.com>
> >> wrote:
> >>>
> >>> Denis,
> >>> How would we deal with races between registration and metadata usages
> >> with
> >>> such fast-fix?
> >>>
> >>> I believe, that we need to move it to distributed metastorage, and
> await
> >>> registration completeness if we can't find it (wait for work in
> >> progress).
> >>> Discovery shouldn't wait for anything here.
> >>>
> >>> On Tue, Aug 20, 2019 at 11:55 AM Denis Mekhanikov <
> dmekhanikov@gmail.com
> >>>
> >>> wrote:
> >>>
> >>>> Sergey,
> >>>>
> >>>> Currently metadata is written to disk sequentially on every node. Only
> >> one
> >>>> node at a time is able to write metadata to its storage.
> >>>> Slowness accumulates when you add more nodes. A delay required to
> write
> >>>> one piece of metadata may be not that big, but if you multiply it by
> say
> >>>> 200, then it becomes noticeable.
> >>>> But If we move the writing out from discovery threads, then nodes will
> >> be
> >>>> doing it in parallel.
> >>>>
> >>>> I think, it’s better to block some threads from a striped pool for
a
> >>>> little while rather than blocking discovery for the same period, but
> >>>> multiplied by a number of nodes.
> >>>>
> >>>> What do you think?
> >>>>
> >>>> Denis
> >>>>
> >>>>> On 15 Aug 2019, at 10:26, Sergey Chugunov <sergey.chugunov@gmail.com
> >
> >>>> wrote:
> >>>>>
> >>>>> Denis,
> >>>>>
> >>>>> Thanks for bringing this issue up, decision to write binary metadata
> >> from
> >>>>> discovery thread was really a tough decision to make.
> >>>>> I don't think that moving metadata to metastorage is a silver bullet
> >> here
> >>>>> as this approach also has its drawbacks and is not an easy change.
> >>>>>
> >>>>> In addition to workarounds suggested by Alexei we have two choices
to
> >>>>> offload write operation from discovery thread:
> >>>>>
> >>>>> 1. Your scheme with a separate writer thread and futures completed
> >> when
> >>>>> write operation is finished.
> >>>>> 2. PME-like protocol with obvious complications like failover and
> >>>>> asynchronous wait for replies over communication layer.
> >>>>>
> >>>>> Your suggestion looks easier from code complexity perspective but
in
> my
> >>>>> view it increases chances to get into starvation. Now if some node
> >> faces
> >>>>> really long delays during write op it is gonna be kicked out of
> >> topology
> >>>> by
> >>>>> discovery protocol. In your case it is possible that more and more
> >>>> threads
> >>>>> from other pools may stuck waiting on the operation future, it is
> also
> >>>> not
> >>>>> good.
> >>>>>
> >>>>> What do you think?
> >>>>>
> >>>>> I also think that if we want to approach this issue systematically,
> we
> >>>> need
> >>>>> to do a deep analysis of metastorage option as well and to finally
> >> choose
> >>>>> which road we wanna go.
> >>>>>
> >>>>> Thanks!
> >>>>>
> >>>>> On Thu, Aug 15, 2019 at 9:28 AM Zhenya Stanilovsky
> >>>>> <arzamas123@mail.ru.invalid> wrote:
> >>>>>
> >>>>>>
> >>>>>>>
> >>>>>>>> 1. Yes, only on OS failures. In such case data will
be received
> from
> >>>>>> alive
> >>>>>>>> nodes later.
> >>>>>> What behavior would be in case of one node ? I suppose someone
can
> >>>> obtain
> >>>>>> cache data without unmarshalling schema, what in this case would
be
> >> with
> >>>>>> grid operability?
> >>>>>>
> >>>>>>>
> >>>>>>>> 2. Yes, for walmode=FSYNC writes to metastore will be
slow. But
> such
> >>>>>> mode
> >>>>>>>> should not be used if you have more than two nodes in
grid because
> >> it
> >>>>>> has
> >>>>>>>> huge impact on performance.
> >>>>>> Is wal mode affects metadata store ?
> >>>>>>
> >>>>>>>
> >>>>>>>>
> >>>>>>>> ср, 14 авг. 2019 г. в 14:29, Denis Mekhanikov
<
> >> dmekhanikov@gmail.com
> >>>>>>> :
> >>>>>>>>
> >>>>>>>>> Folks,
> >>>>>>>>>
> >>>>>>>>> Thanks for showing interest in this issue!
> >>>>>>>>>
> >>>>>>>>> Alexey,
> >>>>>>>>>
> >>>>>>>>>> I think removing fsync could help to mitigate
performance issues
> >>>> with
> >>>>>>>>> current implementation
> >>>>>>>>>
> >>>>>>>>> Is my understanding correct, that if we remove fsync,
then
> >> discovery
> >>>>>> won’t
> >>>>>>>>> be blocked, and data will be flushed to disk in
background, and
> >> loss
> >>>> of
> >>>>>>>>> information will be possible only on OS failure?
It sounds like
> an
> >>>>>>>>> acceptable workaround to me.
> >>>>>>>>>
> >>>>>>>>> Will moving metadata to metastore actually resolve
this issue?
> >> Please
> >>>>>>>>> correct me if I’m wrong, but we will still need
to write the
> >>>>>> information to
> >>>>>>>>> WAL before releasing the discovery thread. If WAL
mode is FSYNC,
> >> then
> >>>>>> the
> >>>>>>>>> issue will still be there. Or is it planned to abandon
the
> >>>>>> discovery-based
> >>>>>>>>> protocol at all?
> >>>>>>>>>
> >>>>>>>>> Evgeniy, Ivan,
> >>>>>>>>>
> >>>>>>>>> In my particular case the data wasn’t too big.
It was a slow
> >>>>>> virtualised
> >>>>>>>>> disk with encryption, that made operations slow.
Given that there
> >> are
> >>>>>> 200
> >>>>>>>>> nodes in a cluster, where every node writes slowly,
and this
> >> process
> >>>> is
> >>>>>>>>> sequential, one piece of metadata is registered
extremely slowly.
> >>>>>>>>>
> >>>>>>>>> Ivan, answering to your other questions:
> >>>>>>>>>
> >>>>>>>>>> 2. Do we need a persistent metadata for in-memory
caches? Or is
> it
> >>>> so
> >>>>>>>>> accidentally?
> >>>>>>>>>
> >>>>>>>>> It should be checked, if it’s safe to stop writing
marshaller
> >>>> mappings
> >>>>>> to
> >>>>>>>>> disk without loosing any guarantees.
> >>>>>>>>> But anyway, I would like to have a property, that
would control
> >> this.
> >>>>>> If
> >>>>>>>>> metadata registration is slow, then initial cluster
warmup may
> >> take a
> >>>>>>>>> while. So, if we preserve metadata on disk, then
we will need to
> >> warm
> >>>>>> it up
> >>>>>>>>> only once, and further restarts won’t be affected.
> >>>>>>>>>
> >>>>>>>>>> Do we really need a fast fix here?
> >>>>>>>>>
> >>>>>>>>> I would like a fix, that could be implemented now,
since the
> >> activity
> >>>>>> with
> >>>>>>>>> moving metadata to metastore doesn’t sound like
a quick one.
> >> Having a
> >>>>>>>>> temporary solution would be nice.
> >>>>>>>>>
> >>>>>>>>> Denis
> >>>>>>>>>
> >>>>>>>>>> On 14 Aug 2019, at 11:53, Павлухин Иван
< vololo100@gmail.com >
> >>>>>> wrote:
> >>>>>>>>>>
> >>>>>>>>>> Denis,
> >>>>>>>>>>
> >>>>>>>>>> Several clarifying questions:
> >>>>>>>>>> 1. Do you have an idea why metadata registration
takes so long?
> So
> >>>>>>>>>> poor disks? So many data to write? A contention
with disk writes
> >> by
> >>>>>>>>>> other subsystems?
> >>>>>>>>>> 2. Do we need a persistent metadata for in-memory
caches? Or is
> it
> >>>> so
> >>>>>>>>>> accidentally?
> >>>>>>>>>>
> >>>>>>>>>> Generally, I think that it is possible to move
metadata saving
> >>>>>>>>>> operations out of discovery thread without loosing
required
> >>>>>>>>>> consistency/integrity.
> >>>>>>>>>>
> >>>>>>>>>> As Alex mentioned using metastore looks like
a better solution.
> Do
> >>>> we
> >>>>>>>>>> really need a fast fix here? (Are we talking
about fast fix?)
> >>>>>>>>>>
> >>>>>>>>>> ср, 14 авг. 2019 г. в 11:45, Zhenya Stanilovsky
> >>>>>>>>> < arzamas123@mail.ru.invalid >:
> >>>>>>>>>>>
> >>>>>>>>>>> Alexey, but in this case customer need to
be informed, that
> whole
> >>>>>> (for
> >>>>>>>>> example 1 node) cluster crash (power off) could
lead to partial
> >> data
> >>>>>>>>> unavailability.
> >>>>>>>>>>> And may be further index corruption.
> >>>>>>>>>>> 1. Why your meta takes a substantial size?
may be context
> >> leaking ?
> >>>>>>>>>>> 2. Could meta be compressed ?
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>> Среда, 14 августа 2019,
11:22 +03:00 от Alexei Scherbakov <
> >>>>>>>>> alexey.scherbakoff@gmail.com >:
> >>>>>>>>>>>>
> >>>>>>>>>>>> Denis Mekhanikov,
> >>>>>>>>>>>>
> >>>>>>>>>>>> Currently metadata are fsync'ed on write.
This might be the
> case
> >>>> of
> >>>>>>>>>>>> slow-downs in case of metadata burst
writes.
> >>>>>>>>>>>> I think removing fsync could help to
mitigate performance
> issues
> >>>>>> with
> >>>>>>>>>>>> current implementation until proper
solution will be
> >> implemented:
> >>>>>>>>> moving
> >>>>>>>>>>>> metadata to metastore.
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>> вт, 13 авг. 2019 г. в 17:09,
Denis Mekhanikov <
> >>>>>> dmekhanikov@gmail.com
> >>>>>>>>>> :
> >>>>>>>>>>>>
> >>>>>>>>>>>>> I would also like to mention, that
marshaller mappings are
> >>>> written
> >>>>>> to
> >>>>>>>>> disk
> >>>>>>>>>>>>> even if persistence is disabled.
> >>>>>>>>>>>>> So, this issue affects purely in-memory
clusters as well.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Denis
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>> On 13 Aug 2019, at 17:06, Denis
Mekhanikov <
> >>>>>> dmekhanikov@gmail.com >
> >>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Hi!
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> When persistence is enabled,
binary metadata is written to
> >> disk
> >>>>>> upon
> >>>>>>>>>>>>> registration. Currently it happens
in the discovery thread,
> >> which
> >>>>>>>>> makes
> >>>>>>>>>>>>> processing of related messages very
slow.
> >>>>>>>>>>>>>> There are cases, when a lot
of nodes and slow disks can make
> >>>> every
> >>>>>>>>>>>>> binary type be registered for several
minutes. Plus it blocks
> >>>>>>>>> processing of
> >>>>>>>>>>>>> other messages.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> I propose starting a separate
thread that will be
> responsible
> >>>> for
> >>>>>>>>>>>>> writing binary metadata to disk.
So, binary type registration
> >>>> will
> >>>>>> be
> >>>>>>>>>>>>> considered finished before information
about it will is
> written
> >>>> to
> >>>>>>>>> disks on
> >>>>>>>>>>>>> all nodes.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> The main concern here is data
consistency in cases when a
> node
> >>>>>>>>>>>>> acknowledges type registration and
then fails before writing
> >> the
> >>>>>>>>> metadata
> >>>>>>>>>>>>> to disk.
> >>>>>>>>>>>>>> I see two parts of this issue:
> >>>>>>>>>>>>>> Nodes will have different metadata
after restarting.
> >>>>>>>>>>>>>> If we write some data into a
persisted cache and shut down
> >> nodes
> >>>>>>>>> faster
> >>>>>>>>>>>>> than a new binary type is written
to disk, then after a
> restart
> >>>> we
> >>>>>>>>> won’t
> >>>>>>>>>>>>> have a binary type to work with.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> The first case is similar to
a situation, when one node
> fails,
> >>>> and
> >>>>>>>>> after
> >>>>>>>>>>>>> that a new type is registered in
the cluster. This issue is
> >>>>>> resolved
> >>>>>>>>> by the
> >>>>>>>>>>>>> discovery data exchange. All nodes
receive information about
> >> all
> >>>>>>>>> binary
> >>>>>>>>>>>>> types in the initial discovery messages
sent by other nodes.
> >> So,
> >>>>>> once
> >>>>>>>>> you
> >>>>>>>>>>>>> restart a node, it will receive
information, that it failed
> to
> >>>>>> finish
> >>>>>>>>>>>>> writing to disk, from other nodes.
> >>>>>>>>>>>>>> If all nodes shut down before
finishing writing the metadata
> >> to
> >>>>>> disk,
> >>>>>>>>>>>>> then after a restart the type will
be considered
> unregistered,
> >> so
> >>>>>>>>> another
> >>>>>>>>>>>>> registration will be required.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> The second case is a bit more
complicated. But it can be
> >>>> resolved
> >>>>>> by
> >>>>>>>>>>>>> making the discovery threads on
every node create a future,
> >> that
> >>>>>> will
> >>>>>>>>> be
> >>>>>>>>>>>>> completed when writing to disk is
finished. So, every node
> will
> >>>>>> have
> >>>>>>>>> such
> >>>>>>>>>>>>> future, that will reflect the current
state of persisting the
> >>>>>>>>> metadata to
> >>>>>>>>>>>>> disk.
> >>>>>>>>>>>>>> After that, if some operation
needs this binary type, it
> will
> >>>>>> need to
> >>>>>>>>>>>>> wait on that future until flushing
to disk is finished.
> >>>>>>>>>>>>>> This way discovery threads won’t
be blocked, but other
> >> threads,
> >>>>>> that
> >>>>>>>>>>>>> actually need this type, will be.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Please let me know what you
think about that.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Denis
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>> --
> >>>>>>>>>>>>
> >>>>>>>>>>>> Best regards,
> >>>>>>>>>>>> Alexei Scherbakov
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> --
> >>>>>>>>>>> Zhenya Stanilovsky
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> --
> >>>>>>>>>> Best regards,
> >>>>>>>>>> Ivan Pavlukhin
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>> --
> >>>>>>>>
> >>>>>>>> Best regards,
> >>>>>>>> Alexei Scherbakov
> >>>>>>>
> >>>>>>
> >>>>>>
> >>>>>> --
> >>>>>> Zhenya Stanilovsky
> >>>>>>
> >>>>
> >>>>
> >>
> >>
> >
> > --
> >
> > Best regards,
> > Alexei Scherbakov
>
>

-- 

Best regards,
Alexei Scherbakov

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message