From Denis Magda <dma...@apache.org>
Subject Re: How to free up space on disc after removing entries from IgniteCache with enabled PDS?
Date Mon, 07 Oct 2019 17:28:29 GMT
```Alex, thanks for the summary and proposal. Anton, Ivan and others who took
part in this discussion, what're your thoughts? I see this
rolling-upgrades-based approach as a reasonable solution. Even though a
node shutdown is expected, the procedure doesn't lead to the cluster outage
meaning it can be utilized for 24x7 production environments.

Denis

On Mon, Oct 7, 2019 at 1:35 AM Alexey Goncharuk <alexey.goncharuk@gmail.com>
wrote:

> Created a ticket for the first stage of this improvement. This can be a
> first change towards the online mode suggested by Sergey and Anton.
> https://issues.apache.org/jira/browse/IGNITE-12263
> пт, 4 окт. 2019 г. в 19:38, Alexey Goncharuk <alexey.goncharuk@gmail.com>:
> > Maxim,
> >
> > Having a cluster-wide lock for a cache does not improve availability of
> > the solution. A user cannot defragment a cache if the cache is involved
> > a mission-critical operation, so having a lock on such a cache is
> > equivalent to the whole cluster shutdown.
> >
> > We should decide between either a single offline node or a more complex
> > fully online solution.
> >
> > пт, 4 окт. 2019 г. в 11:55, Maxim Muzafarov <mmuzaf@apache.org>:
> >
> >> Igniters,
> >>
> >> This thread seems to be endless, but we if some kind of cache group
> >> distributed write lock (exclusive for some of the internal Ignite
> >> process) will be introduced? I think it will help to solve a batch of
> >> problems, like:
> >>
> >> 1. defragmentation of all cache group partitions on the local node
> >> 2. improve data loading with data streamer isolation mode [1]. It
> >> seems we should not allow concurrent updates to cache if we on `fast
> >> 3. recovery from a snapshot without cache stop\start actions
> >>
> >>
> >> [1] https://issues.apache.org/jira/browse/IGNITE-11793
> >>
> >> On Thu, 3 Oct 2019 at 22:50, Sergey Kozlov <skozlov@gridgain.com>
> wrote:
> >> > Hi
> >> >
> >> > I'm not sure that node offline is a best way to do that.
> >> > Cons:
> >> >  - different caches may have different defragmentation but we force to
> >> stop
> >> > whole node
> >> >  - offline node is a maintenance operation will require to add +1
> >> backup to
> >> > reduce the risk of data loss
> >> >  - baseline auto adjustment?
> >> >  - impact to index rebuild?
> >> >  - cache configuration changes (or destroy) during node offline
> >> >
> >> > What about other ways without node stop? E.g. make cache group on a
> node
> >> > offline? Add *defrag <cache_group> *command to control.sh to force
> start
> >> > rebalance internally in the node with expected impact to performance.
> >> >
> >> >
> >> > On Thu, Oct 3, 2019 at 12:08 PM Anton Vinogradov <av@apache.org>
> wrote:
> >> >
> >> > > Alexey,
> >> > > As for me, it does not matter will it be IEP, umbrella or a single
> >> issue.
> >> > > The most important thing is Assignee :)
> >> > >
> >> > > On Thu, Oct 3, 2019 at 11:59 AM Alexey Goncharuk <
> >> > > alexey.goncharuk@gmail.com>
> >> > > wrote:
> >> > >
> >> > > > Anton, do you think we should file a single ticket for this or
> >> should we
> >> > > go
> >> > > > with an IEP? As of now, the change does not look big enough for
> >> > > > me.
> >> > > >
> >> > > > чт, 3 окт. 2019 г. в 11:18, Anton Vinogradov <av@apache.org>:
> >> > > >
> >> > > > > Alexey,
> >> > > > >
> >> > > > > Sounds good to me.
> >> > > > >
> >> > > > > On Thu, Oct 3, 2019 at 10:51 AM Alexey Goncharuk <
> >> > > > > alexey.goncharuk@gmail.com>
> >> > > > > wrote:
> >> > > > >
> >> > > > > > Anton,
> >> > > > > >
> >> > > > > > Switching a partition to and from the SHRINKING state
> >> require
> >> > > > > > intricate synchronizations in order to properly determine
the
> >> start
> >> > > > > > position for historical rebalance without PME.
> >> > > > > >
> >> > > > > > I would still go with an offline-node approach, but
> >> > > cleaning
> >> > > > > the
> >> > > > > > persistence, we can do effective defragmentation when
the node
> >> is
> >> > > > offline
> >> > > > > > because we are sure that there is no concurrent load.
After
> the
> >> > > > > > defragmentation completes, we bring the node back to
the
> >> cluster and
> >> > > > > > historical rebalance will kick in automatically. It
will still
> >> > > require
> >> > > > > > manual node restarts, but since the data is not removed,
there
> >> are no
> >> > > > > > additional risks. Also, this will be an excellent solution
for
> >> those
> >> > > > who
> >> > > > > > can afford downtime and execute the defragment command
on all
> >> nodes
> >> > > in
> >> > > > > the
> >> > > > > > cluster simultaneously - this will be the fastest way
> possible.
> >> > > > > >
> >> > > > > > --AG
> >> > > > > >
> >> > > > > > пн, 30 сент. 2019 г. в 09:29, Anton Vinogradov
> >> > > > > > > Alexei,
> >> > > > > > > >> stopping fragmented node and removing
partition data,
> then
> >> > > > starting
> >> > > > > it
> >> > > > > > > again
> >> > > > > > > That's exactly what we're doing to solve the fragmentation
> >> issue.
> >> > > > > > > The problem here is that we have to perform N/B
> >> restart-rebalance
> >> > > > > > > operations (N - cluster size, B - backups count)
and it
> takes
> >> a lot
> >> > > > of
> >> > > > > > time
> >> > > > > > > with risks to lose the data.
> >> > > > > > >
> >> > > > > > > On Fri, Sep 27, 2019 at 5:49 PM Alexei Scherbakov
> >> > > > > > > alexey.scherbakoff@gmail.com> wrote:
> >> > > > > > >
> >> > > > > > > > Probably this should be allowed to do using
public API,
> >> actually
> >> > > > this
> >> > > > > > is
> >> > > > > > > > same as manual rebalancing.
> >> > > > > > > >
> >> > > > > > > > пт, 27 сент. 2019 г. в 17:40, Alexei
Scherbakov <
> >> > > > > > > > alexey.scherbakoff@gmail.com>:
> >> > > > > > > >
> >> > > > > > > > > The poor man's solution for the problem
would be
> stopping
> >> > > > > fragmented
> >> > > > > > > node
> >> > > > > > > > > and removing partition data, then starting
it again
> >> allowing
> >> > > full
> >> > > > > > state
> >> > > > > > > > > transfer already without deletes.
> >> > > > > > > > > Rinse and repeat for all owners.
> >> > > > > > > > >
> >> > > > > > > > > Anton Vinogradov, would this work for
you as workaround
> ?
> >> > > > > > > > >
> >> > > > > > > > > чт, 19 сент. 2019 г. в 13:03,
> >> av@apache.org
> >> > > >:
> >> > > > > > > > >> Alexey,
> >> > > > > > > > >>
> >> > > > > > > > >> Let's combine your and Ivan's proposals.
> >> > > > > > > > >>
> >> > > > > > > > >> >> vacuum command, which acquires
exclusive table lock,
> >> so no
> >> > > > > > > concurrent
> >> > > > > > > > >> activities on the table are possible.
> >> > > > > > > > >> and
> >> > > > > > > > >> >> Could the problem be solved
by stopping a node which
> >> needs
> >> > > to
> >> > > > > be
> >> > > > > > > > >> defragmented, clearing persistence
files and restarting
> >> the
> >> > > > node?
> >> > > > > > > > >> >> After rebalancing the node
> back
> >> > > without
> >> > > > > > > > >> fragmentation.
> >> > > > > > > > >> How about to have special partition
state SHRINKING?
> >> > > > > > > > >> This state should mean that partition
unavailable for
> >> > > and
> >> > > > > > > updates
> >> > > > > > > > >> but
> >> > > > > > > > >> should keep it's update-counters
and should not be
> >> marked as
> >> > > > lost,
> >> > > > > > > > renting
> >> > > > > > > > >> or evicted.
> >> > > > > > > > >> At this state we able to iterate
over the partition and
> >> apply
> >> > > > it's
> >> > > > > > > > entries
> >> > > > > > > > >> to another file in a compact way.
> >> > > > > > > > >> Indices should be updated during
the copy-on-shrink
> >> procedure
> >> > > or
> >> > > > > at
> >> > > > > > > the
> >> > > > > > > > >> shrink completion.
> >> > > > > > > > >> Once shrank file is ready we should
replace the
> original
> >> > > > partition
> >> > > > > > > file
> >> > > > > > > > >> with it and mark it as MOVING which
will start the
> >> historical
> >> > > > > > > rebalance.
> >> > > > > > > > >> Shrinking should be performed during
the low activity
> >> periods,
> >> > > > but
> >> > > > > > > even
> >> > > > > > > > in
> >> > > > > > > > >> case we found that activity was
high and historical
> >> rebalance
> >> > > is
> >> > > > > not
> >> > > > > > > > >> suitable we may just remove the
file and use regular
> >> rebalance
> >> > > > to
> >> > > > > > > > restore
> >> > > > > > > > >> the partition (this will also lead
to shrink).
> >> > > > > > > > >>
> >> > > > > > > > >> BTW, seems, we able to implement
partition shrink in a
> >> cheap
> >> > > > way.
> >> > > > > > > > >> We may just use rebalancing code
to apply fat
> partition's
> >> > > > entries
> >> > > > > to
> >> > > > > > > the
> >> > > > > > > > >> new file.
> >> > > > > > > > >> So, 3 stages here: local rebalance,
indices update and
> >> global
> >> > > > > > > historical
> >> > > > > > > > >> rebalance.
> >> > > > > > > > >> On Thu, Sep 19, 2019 at 11:43 AM
Alexey Goncharuk <
> >> > > > > > > > >> alexey.goncharuk@gmail.com> wrote:
> >> > > > > > > > >> > Anton,
> >> > > > > > > > >> >
> >> > > > > > > > >> > > >>  The solution
which Anton suggested does not
> look
> >> easy
> >> > > > > > because
> >> > > > > > > it
> >> > > > > > > > >> will
> >> > > > > > > > >> > > most likely significantly
hurt performance
> >> > > > > > > > >> > > Mostly agree here, but
what drop do we expect? What
> >> price
> >> > > do
> >> > > > > we
> >> > > > > > > > ready
> >> > > > > > > > >> to
> >> > > > > > > > >> > > pay?
> >> > > > > > > > >> > > Not sure, but seems some
> >> > > example,
> >> > > > 5%
> >> > > > > > > drop
> >> > > > > > > > >> for
> >> > > > > > > > >> > > this.
> >> > > > > > > > >> > 5% may be a big drop for some
use-cases, so I think
> we
> >> > > should
> >> > > > > look
> >> > > > > > > at
> >> > > > > > > > >> how
> >> > > > > > > > >> > to improve performance, not
how to make it worse.
> >> > > > > > > > >> >
> >> > > > > > > > >> > > >> it is hard to
maintain a data structure to
> choose
> >> "page
> >> > > > > from
> >> > > > > > > > >> free-list
> >> > > > > > > > >> > > with enough space closest
to the beginning of the
> >> file".
> >> > > > > > > > >> > > We can just split each
free-list bucket to the
> >> couple and
> >> > > > use
> >> > > > > > > first
> >> > > > > > > > >> for
> >> > > > > > > > >> > > pages in the first half
of the file and the second
> >> for the
> >> > > > > last.
> >> > > > > > > > >> > > Only two buckets required
here since, during the
> file
> >> > > > shrink,
> >> > > > > > > first
> >> > > > > > > > >> > > bucket's window will be
shrank too.
> >> > > > > > > > >> > > Seems, this give us the
same price on put, just use
> >> the
> >> > > > first
> >> > > > > > > bucket
> >> > > > > > > > >> in
> >> > > > > > > > >> > > case it's not empty.
> >> > > > > > > > >> > > Remove price (with merge)
will be increased, of
> >> course.
> >> > > > > > > > >> > > The compromise solution
is to have priority put (to
> >> the
> >> > > > first
> >> > > > > > path
> >> > > > > > > > of
> >> > > > > > > > >> the
> >> > > > > > > > >> > > file), with keeping removal
as is, and schedulable
> >> > > per-page
> >> > > > > > > > migration
> >> > > > > > > > >> for
> >> > > > > > > > >> > > the rest of the data during
the low activity
> period.
> >> > > > > > > > >> > Free lists are large and slow
by themselves, it is
> >> expensive
> >> > > > to
> >> > > > > > > > >> checkpoint
> >> > > > > > > > >> > and read them on start, so
as a long-term solution I
> >> would
> >> > > > look
> >> > > > > > into
> >> > > > > > > > >> > removing them. Moreover, not
> another
> >> > > > > background
> >> > > > > > > > >> process
> >> > > > > > > > >> > will improve the codebase reliability
and simplicity.
> >> > > > > > > > >> > If we want to go the hard path,
I would look at free
> >> page
> >> > > > > tracking
> >> > > > > > > > >> bitmap -
> >> > > > > > > > >> > a special bitmask page, where
each page in an
> >> block
> >> > > > is
> >> > > > > > > marked
> >> > > > > > > > >> as 0
> >> > > > > > > > >> > if it has free space more than
a certain configurable
> >> > > > threshold
> >> > > > > > > (say,
> >> > > > > > > > >> 80%)
> >> > > > > > > > >> > - free, and 1 if less (full).
Some vendors have
> >> successfully
> >> > > > > > > > implemented
> >> > > > > > > > >> > this approach, which looks
much more promising, but
> >> harder
> >> > > to
> >> > > > > > > > implement.
> >> > > > > > > > >> > --AG
> >> > > > > > > > > --
> >> > > > > > > > >
> >> > > > > > > > > Best regards,
> >> > > > > > > > > Alexei Scherbakov
> >> > > > > > > > >
> >> > > > > > > > --
> >> > > > > > > >
> >> > > > > > > > Best regards,
> >> > > > > > > > Alexei Scherbakov
> >> > > > > > > >
> >> > --
> >> > Sergey Kozlov
> >> > GridGain Systems
> >> > www.gridgain.com
