ignite-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Denis Mekhanikov <dmekhani...@gmail.com>
Subject Re: Asynchronous registration of binary metadata
Date Tue, 20 Aug 2019 14:25:36 GMT
Eduard,

Usages will wait for the metadata to be registered and written to disk. No races should occur
with such flow.
Or do you have some specific case on your mind?

I agree, that using a distributed meta storage would be nice here. 
But this way we will kind of move to the previous scheme with a replicated system cache, where
metadata was stored before.
Will scheme with the metastorage be different in any way? Won’t we decide to move back to
discovery messages again after a while?

Denis


> On 20 Aug 2019, at 15:13, Eduard Shangareev <eduard.shangareev@gmail.com> wrote:
> 
> Denis,
> How would we deal with races between registration and metadata usages with
> such fast-fix?
> 
> I believe, that we need to move it to distributed metastorage, and await
> registration completeness if we can't find it (wait for work in progress).
> Discovery shouldn't wait for anything here.
> 
> On Tue, Aug 20, 2019 at 11:55 AM Denis Mekhanikov <dmekhanikov@gmail.com>
> wrote:
> 
>> Sergey,
>> 
>> Currently metadata is written to disk sequentially on every node. Only one
>> node at a time is able to write metadata to its storage.
>> Slowness accumulates when you add more nodes. A delay required to write
>> one piece of metadata may be not that big, but if you multiply it by say
>> 200, then it becomes noticeable.
>> But If we move the writing out from discovery threads, then nodes will be
>> doing it in parallel.
>> 
>> I think, it’s better to block some threads from a striped pool for a
>> little while rather than blocking discovery for the same period, but
>> multiplied by a number of nodes.
>> 
>> What do you think?
>> 
>> Denis
>> 
>>> On 15 Aug 2019, at 10:26, Sergey Chugunov <sergey.chugunov@gmail.com>
>> wrote:
>>> 
>>> Denis,
>>> 
>>> Thanks for bringing this issue up, decision to write binary metadata from
>>> discovery thread was really a tough decision to make.
>>> I don't think that moving metadata to metastorage is a silver bullet here
>>> as this approach also has its drawbacks and is not an easy change.
>>> 
>>> In addition to workarounds suggested by Alexei we have two choices to
>>> offload write operation from discovery thread:
>>> 
>>>  1. Your scheme with a separate writer thread and futures completed when
>>>  write operation is finished.
>>>  2. PME-like protocol with obvious complications like failover and
>>>  asynchronous wait for replies over communication layer.
>>> 
>>> Your suggestion looks easier from code complexity perspective but in my
>>> view it increases chances to get into starvation. Now if some node faces
>>> really long delays during write op it is gonna be kicked out of topology
>> by
>>> discovery protocol. In your case it is possible that more and more
>> threads
>>> from other pools may stuck waiting on the operation future, it is also
>> not
>>> good.
>>> 
>>> What do you think?
>>> 
>>> I also think that if we want to approach this issue systematically, we
>> need
>>> to do a deep analysis of metastorage option as well and to finally choose
>>> which road we wanna go.
>>> 
>>> Thanks!
>>> 
>>> On Thu, Aug 15, 2019 at 9:28 AM Zhenya Stanilovsky
>>> <arzamas123@mail.ru.invalid> wrote:
>>> 
>>>> 
>>>>> 
>>>>>> 1. Yes, only on OS failures. In such case data will be received from
>>>> alive
>>>>>> nodes later.
>>>> What behavior would be in case of one node ? I suppose someone can
>> obtain
>>>> cache data without unmarshalling schema, what in this case would be with
>>>> grid operability?
>>>> 
>>>>> 
>>>>>> 2. Yes, for walmode=FSYNC writes to metastore will be slow. But such
>>>> mode
>>>>>> should not be used if you have more than two nodes in grid because
it
>>>> has
>>>>>> huge impact on performance.
>>>> Is wal mode affects metadata store ?
>>>> 
>>>>> 
>>>>>> 
>>>>>> ср, 14 авг. 2019 г. в 14:29, Denis Mekhanikov < dmekhanikov@gmail.com
>>>>> :
>>>>>> 
>>>>>>> Folks,
>>>>>>> 
>>>>>>> Thanks for showing interest in this issue!
>>>>>>> 
>>>>>>> Alexey,
>>>>>>> 
>>>>>>>> I think removing fsync could help to mitigate performance
issues
>> with
>>>>>>> current implementation
>>>>>>> 
>>>>>>> Is my understanding correct, that if we remove fsync, then discovery
>>>> won’t
>>>>>>> be blocked, and data will be flushed to disk in background, and
loss
>> of
>>>>>>> information will be possible only on OS failure? It sounds like
an
>>>>>>> acceptable workaround to me.
>>>>>>> 
>>>>>>> Will moving metadata to metastore actually resolve this issue?
Please
>>>>>>> correct me if I’m wrong, but we will still need to write the
>>>> information to
>>>>>>> WAL before releasing the discovery thread. If WAL mode is FSYNC,
then
>>>> the
>>>>>>> issue will still be there. Or is it planned to abandon the
>>>> discovery-based
>>>>>>> protocol at all?
>>>>>>> 
>>>>>>> Evgeniy, Ivan,
>>>>>>> 
>>>>>>> In my particular case the data wasn’t too big. It was a slow
>>>> virtualised
>>>>>>> disk with encryption, that made operations slow. Given that there
are
>>>> 200
>>>>>>> nodes in a cluster, where every node writes slowly, and this
process
>> is
>>>>>>> sequential, one piece of metadata is registered extremely slowly.
>>>>>>> 
>>>>>>> Ivan, answering to your other questions:
>>>>>>> 
>>>>>>>> 2. Do we need a persistent metadata for in-memory caches?
Or is it
>> so
>>>>>>> accidentally?
>>>>>>> 
>>>>>>> It should be checked, if it’s safe to stop writing marshaller
>> mappings
>>>> to
>>>>>>> disk without loosing any guarantees.
>>>>>>> But anyway, I would like to have a property, that would control
this.
>>>> If
>>>>>>> metadata registration is slow, then initial cluster warmup may
take a
>>>>>>> while. So, if we preserve metadata on disk, then we will need
to warm
>>>> it up
>>>>>>> only once, and further restarts won’t be affected.
>>>>>>> 
>>>>>>>> Do we really need a fast fix here?
>>>>>>> 
>>>>>>> I would like a fix, that could be implemented now, since the
activity
>>>> with
>>>>>>> moving metadata to metastore doesn’t sound like a quick one.
Having a
>>>>>>> temporary solution would be nice.
>>>>>>> 
>>>>>>> Denis
>>>>>>> 
>>>>>>>> On 14 Aug 2019, at 11:53, Павлухин Иван <
vololo100@gmail.com >
>>>> wrote:
>>>>>>>> 
>>>>>>>> Denis,
>>>>>>>> 
>>>>>>>> Several clarifying questions:
>>>>>>>> 1. Do you have an idea why metadata registration takes so
long? So
>>>>>>>> poor disks? So many data to write? A contention with disk
writes by
>>>>>>>> other subsystems?
>>>>>>>> 2. Do we need a persistent metadata for in-memory caches?
Or is it
>> so
>>>>>>>> accidentally?
>>>>>>>> 
>>>>>>>> Generally, I think that it is possible to move metadata saving
>>>>>>>> operations out of discovery thread without loosing required
>>>>>>>> consistency/integrity.
>>>>>>>> 
>>>>>>>> As Alex mentioned using metastore looks like a better solution.
Do
>> we
>>>>>>>> really need a fast fix here? (Are we talking about fast fix?)
>>>>>>>> 
>>>>>>>> ср, 14 авг. 2019 г. в 11:45, Zhenya Stanilovsky
>>>>>>> < arzamas123@mail.ru.invalid >:
>>>>>>>>> 
>>>>>>>>> Alexey, but in this case customer need to be informed,
that whole
>>>> (for
>>>>>>> example 1 node) cluster crash (power off) could lead to partial
data
>>>>>>> unavailability.
>>>>>>>>> And may be further index corruption.
>>>>>>>>> 1. Why your meta takes a substantial size? may be context
leaking ?
>>>>>>>>> 2. Could meta be compressed ?
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>>> Среда, 14 августа 2019, 11:22 +03:00
от Alexei Scherbakov <
>>>>>>> alexey.scherbakoff@gmail.com >:
>>>>>>>>>> 
>>>>>>>>>> Denis Mekhanikov,
>>>>>>>>>> 
>>>>>>>>>> Currently metadata are fsync'ed on write. This might
be the case
>> of
>>>>>>>>>> slow-downs in case of metadata burst writes.
>>>>>>>>>> I think removing fsync could help to mitigate performance
issues
>>>> with
>>>>>>>>>> current implementation until proper solution will
be implemented:
>>>>>>> moving
>>>>>>>>>> metadata to metastore.
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> вт, 13 авг. 2019 г. в 17:09, Denis Mekhanikov
<
>>>> dmekhanikov@gmail.com
>>>>>>>> :
>>>>>>>>>> 
>>>>>>>>>>> I would also like to mention, that marshaller
mappings are
>> written
>>>> to
>>>>>>> disk
>>>>>>>>>>> even if persistence is disabled.
>>>>>>>>>>> So, this issue affects purely in-memory clusters
as well.
>>>>>>>>>>> 
>>>>>>>>>>> Denis
>>>>>>>>>>> 
>>>>>>>>>>>> On 13 Aug 2019, at 17:06, Denis Mekhanikov
<
>>>> dmekhanikov@gmail.com >
>>>>>>>>>>> wrote:
>>>>>>>>>>>> 
>>>>>>>>>>>> Hi!
>>>>>>>>>>>> 
>>>>>>>>>>>> When persistence is enabled, binary metadata
is written to disk
>>>> upon
>>>>>>>>>>> registration. Currently it happens in the discovery
thread, which
>>>>>>> makes
>>>>>>>>>>> processing of related messages very slow.
>>>>>>>>>>>> There are cases, when a lot of nodes and
slow disks can make
>> every
>>>>>>>>>>> binary type be registered for several minutes.
Plus it blocks
>>>>>>> processing of
>>>>>>>>>>> other messages.
>>>>>>>>>>>> 
>>>>>>>>>>>> I propose starting a separate thread that
will be responsible
>> for
>>>>>>>>>>> writing binary metadata to disk. So, binary type
registration
>> will
>>>> be
>>>>>>>>>>> considered finished before information about
it will is written
>> to
>>>>>>> disks on
>>>>>>>>>>> all nodes.
>>>>>>>>>>>> 
>>>>>>>>>>>> The main concern here is data consistency
in cases when a node
>>>>>>>>>>> acknowledges type registration and then fails
before writing the
>>>>>>> metadata
>>>>>>>>>>> to disk.
>>>>>>>>>>>> I see two parts of this issue:
>>>>>>>>>>>> Nodes will have different metadata after
restarting.
>>>>>>>>>>>> If we write some data into a persisted cache
and shut down nodes
>>>>>>> faster
>>>>>>>>>>> than a new binary type is written to disk, then
after a restart
>> we
>>>>>>> won’t
>>>>>>>>>>> have a binary type to work with.
>>>>>>>>>>>> 
>>>>>>>>>>>> The first case is similar to a situation,
when one node fails,
>> and
>>>>>>> after
>>>>>>>>>>> that a new type is registered in the cluster.
This issue is
>>>> resolved
>>>>>>> by the
>>>>>>>>>>> discovery data exchange. All nodes receive information
about all
>>>>>>> binary
>>>>>>>>>>> types in the initial discovery messages sent
by other nodes. So,
>>>> once
>>>>>>> you
>>>>>>>>>>> restart a node, it will receive information,
that it failed to
>>>> finish
>>>>>>>>>>> writing to disk, from other nodes.
>>>>>>>>>>>> If all nodes shut down before finishing writing
the metadata to
>>>> disk,
>>>>>>>>>>> then after a restart the type will be considered
unregistered, so
>>>>>>> another
>>>>>>>>>>> registration will be required.
>>>>>>>>>>>> 
>>>>>>>>>>>> The second case is a bit more complicated.
But it can be
>> resolved
>>>> by
>>>>>>>>>>> making the discovery threads on every node create
a future, that
>>>> will
>>>>>>> be
>>>>>>>>>>> completed when writing to disk is finished. So,
every node will
>>>> have
>>>>>>> such
>>>>>>>>>>> future, that will reflect the current state of
persisting the
>>>>>>> metadata to
>>>>>>>>>>> disk.
>>>>>>>>>>>> After that, if some operation needs this
binary type, it will
>>>> need to
>>>>>>>>>>> wait on that future until flushing to disk is
finished.
>>>>>>>>>>>> This way discovery threads won’t be blocked,
but other threads,
>>>> that
>>>>>>>>>>> actually need this type, will be.
>>>>>>>>>>>> 
>>>>>>>>>>>> Please let me know what you think about that.
>>>>>>>>>>>> 
>>>>>>>>>>>> Denis
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> --
>>>>>>>>>> 
>>>>>>>>>> Best regards,
>>>>>>>>>> Alexei Scherbakov
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> --
>>>>>>>>> Zhenya Stanilovsky
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> --
>>>>>>>> Best regards,
>>>>>>>> Ivan Pavlukhin
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>>> --
>>>>>> 
>>>>>> Best regards,
>>>>>> Alexei Scherbakov
>>>>> 
>>>> 
>>>> 
>>>> --
>>>> Zhenya Stanilovsky
>>>> 
>> 
>> 


Mime
View raw message