ignite-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sergey Kozlov <skoz...@gridgain.com>
Subject Re: How to free up space on disc after removing entries from IgniteCache with enabled PDS?
Date Thu, 10 Oct 2019 18:59:29 GMT
Alexey

I'm ok for the suggested way in [1]

1. https://issues.apache.org/jira/browse/IGNITE-12263

On Tue, Oct 8, 2019 at 9:59 PM Denis Magda <dmagda@apache.org> wrote:

> Anton,
>
> Seems like we have a name for the defragmentation mode with a downtime -
> Rolling Defrag )
>
> -
> Denis
>
>
> On Mon, Oct 7, 2019 at 11:04 PM Anton Vinogradov <av@apache.org> wrote:
>
> > Denis,
> >
> > I like the idea that defragmentation is just an additional step on a node
> > (re)start like we perform PDS recovery now.
> > We may just use special key to specify node should defragment persistence
> > on (re)start.
> > Defragmentation can be the part of Rolling Upgrade in this case :)
> > It seems to be not a problem to restart nodes one-by-one, this will "eat"
> > only one backup guarantee.
> >
> > On Mon, Oct 7, 2019 at 8:28 PM Denis Magda <dmagda@apache.org> wrote:
> >
> > > Alex, thanks for the summary and proposal. Anton, Ivan and others who
> > took
> > > part in this discussion, what're your thoughts? I see this
> > > rolling-upgrades-based approach as a reasonable solution. Even though a
> > > node shutdown is expected, the procedure doesn't lead to the cluster
> > outage
> > > meaning it can be utilized for 24x7 production environments.
> > >
> > > -
> > > Denis
> > >
> > >
> > > On Mon, Oct 7, 2019 at 1:35 AM Alexey Goncharuk <
> > > alexey.goncharuk@gmail.com>
> > > wrote:
> > >
> > > > Created a ticket for the first stage of this improvement. This can
> be a
> > > > first change towards the online mode suggested by Sergey and Anton.
> > > > https://issues.apache.org/jira/browse/IGNITE-12263
> > > >
> > > > пт, 4 окт. 2019 г. в 19:38, Alexey Goncharuk <
> > alexey.goncharuk@gmail.com
> > > >:
> > > >
> > > > > Maxim,
> > > > >
> > > > > Having a cluster-wide lock for a cache does not improve
> availability
> > of
> > > > > the solution. A user cannot defragment a cache if the cache is
> > involved
> > > > in
> > > > > a mission-critical operation, so having a lock on such a cache is
> > > > > equivalent to the whole cluster shutdown.
> > > > >
> > > > > We should decide between either a single offline node or a more
> > complex
> > > > > fully online solution.
> > > > >
> > > > > пт, 4 окт. 2019 г. в 11:55, Maxim Muzafarov <mmuzaf@apache.org>:
> > > > >
> > > > >> Igniters,
> > > > >>
> > > > >> This thread seems to be endless, but we if some kind of cache
> group
> > > > >> distributed write lock (exclusive for some of the internal Ignite
> > > > >> process) will be introduced? I think it will help to solve a
batch
> > of
> > > > >> problems, like:
> > > > >>
> > > > >> 1. defragmentation of all cache group partitions on the local
node
> > > > >> without concurrent updates.
> > > > >> 2. improve data loading with data streamer isolation mode [1].
It
> > > > >> seems we should not allow concurrent updates to cache if we on
> `fast
> > > > >> data load` step.
> > > > >> 3. recovery from a snapshot without cache stop\start actions
> > > > >>
> > > > >>
> > > > >> [1] https://issues.apache.org/jira/browse/IGNITE-11793
> > > > >>
> > > > >> On Thu, 3 Oct 2019 at 22:50, Sergey Kozlov <skozlov@gridgain.com>
> > > > wrote:
> > > > >> >
> > > > >> > Hi
> > > > >> >
> > > > >> > I'm not sure that node offline is a best way to do that.
> > > > >> > Cons:
> > > > >> >  - different caches may have different defragmentation but
we
> > force
> > > to
> > > > >> stop
> > > > >> > whole node
> > > > >> >  - offline node is a maintenance operation will require
to add
> +1
> > > > >> backup to
> > > > >> > reduce the risk of data loss
> > > > >> >  - baseline auto adjustment?
> > > > >> >  - impact to index rebuild?
> > > > >> >  - cache configuration changes (or destroy) during node
offline
> > > > >> >
> > > > >> > What about other ways without node stop? E.g. make cache
group
> on
> > a
> > > > node
> > > > >> > offline? Add *defrag <cache_group> *command to control.sh
to
> force
> > > > start
> > > > >> > rebalance internally in the node with expected impact to
> > > performance.
> > > > >> >
> > > > >> >
> > > > >> >
> > > > >> > On Thu, Oct 3, 2019 at 12:08 PM Anton Vinogradov <av@apache.org
> >
> > > > wrote:
> > > > >> >
> > > > >> > > Alexey,
> > > > >> > > As for me, it does not matter will it be IEP, umbrella
or a
> > single
> > > > >> issue.
> > > > >> > > The most important thing is Assignee :)
> > > > >> > >
> > > > >> > > On Thu, Oct 3, 2019 at 11:59 AM Alexey Goncharuk <
> > > > >> > > alexey.goncharuk@gmail.com>
> > > > >> > > wrote:
> > > > >> > >
> > > > >> > > > Anton, do you think we should file a single ticket
for this
> or
> > > > >> should we
> > > > >> > > go
> > > > >> > > > with an IEP? As of now, the change does not look
big enough
> > for
> > > an
> > > > >> IEP
> > > > >> > > for
> > > > >> > > > me.
> > > > >> > > >
> > > > >> > > > чт, 3 окт. 2019 г. в 11:18, Anton Vinogradov
<av@apache.org
> >:
> > > > >> > > >
> > > > >> > > > > Alexey,
> > > > >> > > > >
> > > > >> > > > > Sounds good to me.
> > > > >> > > > >
> > > > >> > > > > On Thu, Oct 3, 2019 at 10:51 AM Alexey Goncharuk
<
> > > > >> > > > > alexey.goncharuk@gmail.com>
> > > > >> > > > > wrote:
> > > > >> > > > >
> > > > >> > > > > > Anton,
> > > > >> > > > > >
> > > > >> > > > > > Switching a partition to and from the
SHRINKING state
> will
> > > > >> require
> > > > >> > > > > > intricate synchronizations in order
to properly
> determine
> > > the
> > > > >> start
> > > > >> > > > > > position for historical rebalance without
PME.
> > > > >> > > > > >
> > > > >> > > > > > I would still go with an offline-node
approach, but
> > instead
> > > of
> > > > >> > > cleaning
> > > > >> > > > > the
> > > > >> > > > > > persistence, we can do effective defragmentation
when
> the
> > > node
> > > > >> is
> > > > >> > > > offline
> > > > >> > > > > > because we are sure that there is no
concurrent load.
> > After
> > > > the
> > > > >> > > > > > defragmentation completes, we bring
the node back to the
> > > > >> cluster and
> > > > >> > > > > > historical rebalance will kick in automatically.
It will
> > > still
> > > > >> > > require
> > > > >> > > > > > manual node restarts, but since the
data is not removed,
> > > there
> > > > >> are no
> > > > >> > > > > > additional risks. Also, this will be
an excellent
> solution
> > > for
> > > > >> those
> > > > >> > > > who
> > > > >> > > > > > can afford downtime and execute the
defragment command
> on
> > > all
> > > > >> nodes
> > > > >> > > in
> > > > >> > > > > the
> > > > >> > > > > > cluster simultaneously - this will be
the fastest way
> > > > possible.
> > > > >> > > > > >
> > > > >> > > > > > --AG
> > > > >> > > > > >
> > > > >> > > > > > пн, 30 сент. 2019 г. в 09:29,
Anton Vinogradov <
> > > av@apache.org
> > > > >:
> > > > >> > > > > >
> > > > >> > > > > > > Alexei,
> > > > >> > > > > > > >> stopping fragmented node
and removing partition
> data,
> > > > then
> > > > >> > > > starting
> > > > >> > > > > it
> > > > >> > > > > > > again
> > > > >> > > > > > >
> > > > >> > > > > > > That's exactly what we're doing
to solve the
> > fragmentation
> > > > >> issue.
> > > > >> > > > > > > The problem here is that we have
to perform N/B
> > > > >> restart-rebalance
> > > > >> > > > > > > operations (N - cluster size, B
- backups count) and
> it
> > > > takes
> > > > >> a lot
> > > > >> > > > of
> > > > >> > > > > > time
> > > > >> > > > > > > with risks to lose the data.
> > > > >> > > > > > >
> > > > >> > > > > > > On Fri, Sep 27, 2019 at 5:49 PM
Alexei Scherbakov <
> > > > >> > > > > > > alexey.scherbakoff@gmail.com>
wrote:
> > > > >> > > > > > >
> > > > >> > > > > > > > Probably this should be allowed
to do using public
> > API,
> > > > >> actually
> > > > >> > > > this
> > > > >> > > > > > is
> > > > >> > > > > > > > same as manual rebalancing.
> > > > >> > > > > > > >
> > > > >> > > > > > > > пт, 27 сент. 2019 г.
в 17:40, Alexei Scherbakov <
> > > > >> > > > > > > > alexey.scherbakoff@gmail.com>:
> > > > >> > > > > > > >
> > > > >> > > > > > > > > The poor man's solution
for the problem would be
> > > > stopping
> > > > >> > > > > fragmented
> > > > >> > > > > > > node
> > > > >> > > > > > > > > and removing partition
data, then starting it
> again
> > > > >> allowing
> > > > >> > > full
> > > > >> > > > > > state
> > > > >> > > > > > > > > transfer already without
deletes.
> > > > >> > > > > > > > > Rinse and repeat for
all owners.
> > > > >> > > > > > > > >
> > > > >> > > > > > > > > Anton Vinogradov, would
this work for you as
> > > workaround
> > > > ?
> > > > >> > > > > > > > >
> > > > >> > > > > > > > > чт, 19 сент. 2019
г. в 13:03, Anton Vinogradov <
> > > > >> av@apache.org
> > > > >> > > >:
> > > > >> > > > > > > > >
> > > > >> > > > > > > > >> Alexey,
> > > > >> > > > > > > > >>
> > > > >> > > > > > > > >> Let's combine your
and Ivan's proposals.
> > > > >> > > > > > > > >>
> > > > >> > > > > > > > >> >> vacuum command,
which acquires exclusive table
> > > lock,
> > > > >> so no
> > > > >> > > > > > > concurrent
> > > > >> > > > > > > > >> activities on the
table are possible.
> > > > >> > > > > > > > >> and
> > > > >> > > > > > > > >> >> Could the
problem be solved by stopping a node
> > > which
> > > > >> needs
> > > > >> > > to
> > > > >> > > > > be
> > > > >> > > > > > > > >> defragmented, clearing
persistence files and
> > > restarting
> > > > >> the
> > > > >> > > > node?
> > > > >> > > > > > > > >> >> After rebalancing
the node will receive all
> data
> > > > back
> > > > >> > > without
> > > > >> > > > > > > > >> fragmentation.
> > > > >> > > > > > > > >>
> > > > >> > > > > > > > >> How about to have
special partition state
> > SHRINKING?
> > > > >> > > > > > > > >> This state should
mean that partition unavailable
> > for
> > > > >> reads
> > > > >> > > and
> > > > >> > > > > > > updates
> > > > >> > > > > > > > >> but
> > > > >> > > > > > > > >> should keep it's
update-counters and should not
> be
> > > > >> marked as
> > > > >> > > > lost,
> > > > >> > > > > > > > renting
> > > > >> > > > > > > > >> or evicted.
> > > > >> > > > > > > > >> At this state we
able to iterate over the
> partition
> > > and
> > > > >> apply
> > > > >> > > > it's
> > > > >> > > > > > > > entries
> > > > >> > > > > > > > >> to another file in
a compact way.
> > > > >> > > > > > > > >> Indices should be
updated during the
> copy-on-shrink
> > > > >> procedure
> > > > >> > > or
> > > > >> > > > > at
> > > > >> > > > > > > the
> > > > >> > > > > > > > >> shrink completion.
> > > > >> > > > > > > > >> Once shrank file
is ready we should replace the
> > > > original
> > > > >> > > > partition
> > > > >> > > > > > > file
> > > > >> > > > > > > > >> with it and mark
it as MOVING which will start
> the
> > > > >> historical
> > > > >> > > > > > > rebalance.
> > > > >> > > > > > > > >> Shrinking should
be performed during the low
> > activity
> > > > >> periods,
> > > > >> > > > but
> > > > >> > > > > > > even
> > > > >> > > > > > > > in
> > > > >> > > > > > > > >> case we found that
activity was high and
> historical
> > > > >> rebalance
> > > > >> > > is
> > > > >> > > > > not
> > > > >> > > > > > > > >> suitable we may just
remove the file and use
> > regular
> > > > >> rebalance
> > > > >> > > > to
> > > > >> > > > > > > > restore
> > > > >> > > > > > > > >> the partition (this
will also lead to shrink).
> > > > >> > > > > > > > >>
> > > > >> > > > > > > > >> BTW, seems, we able
to implement partition shrink
> > in
> > > a
> > > > >> cheap
> > > > >> > > > way.
> > > > >> > > > > > > > >> We may just use rebalancing
code to apply fat
> > > > partition's
> > > > >> > > > entries
> > > > >> > > > > to
> > > > >> > > > > > > the
> > > > >> > > > > > > > >> new file.
> > > > >> > > > > > > > >> So, 3 stages here:
local rebalance, indices
> update
> > > and
> > > > >> global
> > > > >> > > > > > > historical
> > > > >> > > > > > > > >> rebalance.
> > > > >> > > > > > > > >>
> > > > >> > > > > > > > >> On Thu, Sep 19, 2019
at 11:43 AM Alexey
> Goncharuk <
> > > > >> > > > > > > > >> alexey.goncharuk@gmail.com>
wrote:
> > > > >> > > > > > > > >>
> > > > >> > > > > > > > >> > Anton,
> > > > >> > > > > > > > >> >
> > > > >> > > > > > > > >> >
> > > > >> > > > > > > > >> > > >>
 The solution which Anton suggested does
> not
> > > > look
> > > > >> easy
> > > > >> > > > > > because
> > > > >> > > > > > > it
> > > > >> > > > > > > > >> will
> > > > >> > > > > > > > >> > > most likely
significantly hurt performance
> > > > >> > > > > > > > >> > > Mostly
agree here, but what drop do we
> expect?
> > > What
> > > > >> price
> > > > >> > > do
> > > > >> > > > > we
> > > > >> > > > > > > > ready
> > > > >> > > > > > > > >> to
> > > > >> > > > > > > > >> > > pay?
> > > > >> > > > > > > > >> > > Not sure,
but seems some vendors ready to
> pay,
> > > for
> > > > >> > > example,
> > > > >> > > > 5%
> > > > >> > > > > > > drop
> > > > >> > > > > > > > >> for
> > > > >> > > > > > > > >> > > this.
> > > > >> > > > > > > > >> >
> > > > >> > > > > > > > >> > 5% may be a
big drop for some use-cases, so I
> > think
> > > > we
> > > > >> > > should
> > > > >> > > > > look
> > > > >> > > > > > > at
> > > > >> > > > > > > > >> how
> > > > >> > > > > > > > >> > to improve performance,
not how to make it
> worse.
> > > > >> > > > > > > > >> >
> > > > >> > > > > > > > >> >
> > > > >> > > > > > > > >> > >
> > > > >> > > > > > > > >> > > >>
it is hard to maintain a data structure to
> > > > choose
> > > > >> "page
> > > > >> > > > > from
> > > > >> > > > > > > > >> free-list
> > > > >> > > > > > > > >> > > with enough
space closest to the beginning of
> > the
> > > > >> file".
> > > > >> > > > > > > > >> > > We can
just split each free-list bucket to
> the
> > > > >> couple and
> > > > >> > > > use
> > > > >> > > > > > > first
> > > > >> > > > > > > > >> for
> > > > >> > > > > > > > >> > > pages in
the first half of the file and the
> > > second
> > > > >> for the
> > > > >> > > > > last.
> > > > >> > > > > > > > >> > > Only two
buckets required here since, during
> > the
> > > > file
> > > > >> > > > shrink,
> > > > >> > > > > > > first
> > > > >> > > > > > > > >> > > bucket's
window will be shrank too.
> > > > >> > > > > > > > >> > > Seems,
this give us the same price on put,
> just
> > > use
> > > > >> the
> > > > >> > > > first
> > > > >> > > > > > > bucket
> > > > >> > > > > > > > >> in
> > > > >> > > > > > > > >> > > case it's
not empty.
> > > > >> > > > > > > > >> > > Remove
price (with merge) will be increased,
> of
> > > > >> course.
> > > > >> > > > > > > > >> > >
> > > > >> > > > > > > > >> > > The compromise
solution is to have priority
> put
> > > (to
> > > > >> the
> > > > >> > > > first
> > > > >> > > > > > path
> > > > >> > > > > > > > of
> > > > >> > > > > > > > >> the
> > > > >> > > > > > > > >> > > file),
with keeping removal as is, and
> > > schedulable
> > > > >> > > per-page
> > > > >> > > > > > > > migration
> > > > >> > > > > > > > >> for
> > > > >> > > > > > > > >> > > the rest
of the data during the low activity
> > > > period.
> > > > >> > > > > > > > >> > >
> > > > >> > > > > > > > >> > Free lists are
large and slow by themselves, it
> > is
> > > > >> expensive
> > > > >> > > > to
> > > > >> > > > > > > > >> checkpoint
> > > > >> > > > > > > > >> > and read them
on start, so as a long-term
> > solution
> > > I
> > > > >> would
> > > > >> > > > look
> > > > >> > > > > > into
> > > > >> > > > > > > > >> > removing them.
Moreover, not sure if adding yet
> > > > another
> > > > >> > > > > background
> > > > >> > > > > > > > >> process
> > > > >> > > > > > > > >> > will improve
the codebase reliability and
> > > simplicity.
> > > > >> > > > > > > > >> >
> > > > >> > > > > > > > >> > If we want to
go the hard path, I would look at
> > > free
> > > > >> page
> > > > >> > > > > tracking
> > > > >> > > > > > > > >> bitmap -
> > > > >> > > > > > > > >> > a special bitmask
page, where each page in an
> > > > adjacent
> > > > >> block
> > > > >> > > > is
> > > > >> > > > > > > marked
> > > > >> > > > > > > > >> as 0
> > > > >> > > > > > > > >> > if it has free
space more than a certain
> > > configurable
> > > > >> > > > threshold
> > > > >> > > > > > > (say,
> > > > >> > > > > > > > >> 80%)
> > > > >> > > > > > > > >> > - free, and
1 if less (full). Some vendors have
> > > > >> successfully
> > > > >> > > > > > > > implemented
> > > > >> > > > > > > > >> > this approach,
which looks much more promising,
> > but
> > > > >> harder
> > > > >> > > to
> > > > >> > > > > > > > implement.
> > > > >> > > > > > > > >> >
> > > > >> > > > > > > > >> > --AG
> > > > >> > > > > > > > >> >
> > > > >> > > > > > > > >>
> > > > >> > > > > > > > >
> > > > >> > > > > > > > >
> > > > >> > > > > > > > > --
> > > > >> > > > > > > > >
> > > > >> > > > > > > > > Best regards,
> > > > >> > > > > > > > > Alexei Scherbakov
> > > > >> > > > > > > > >
> > > > >> > > > > > > >
> > > > >> > > > > > > >
> > > > >> > > > > > > > --
> > > > >> > > > > > > >
> > > > >> > > > > > > > Best regards,
> > > > >> > > > > > > > Alexei Scherbakov
> > > > >> > > > > > > >
> > > > >> > > > > > >
> > > > >> > > > > >
> > > > >> > > > >
> > > > >> > > >
> > > > >> > >
> > > > >> >
> > > > >> >
> > > > >> > --
> > > > >> > Sergey Kozlov
> > > > >> > GridGain Systems
> > > > >> > www.gridgain.com
> > > > >>
> > > > >
> > > >
> > >
> >
>


-- 
Sergey Kozlov
GridGain Systems
www.gridgain.com

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message