ignite-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Alexey Goncharuk <alexey.goncha...@gmail.com>
Subject Re: How properly handle IgniteOOM
Date Thu, 14 Dec 2017 09:30:34 GMT
Mikhail,

Here is the first idea that came to my mind. Before a transaction is
committed (or an atomic update is applied), we have all entries being
written on hands. We can estimate the maximum amount of memory required for
this to happen and make a reservation (one AtomicLong CAS) for this memory.
If we cannot reserve memory - throw the OOME early. This way we should
never get into a situation when it's too late to give up.

However, this may not be a very easy task, so we probably need to make a
fast prototype to prove the idea works before we start implementing it
fully.

--AG

2017-12-14 12:22 GMT+03:00 Mikhail Cherkasov <mcherkasov@gridgain.com>:

> Hi Denis,
>
> but should we treat current behavior as a bug that should be fixed asap or
> currently we should treat it as a known limitation?
> Because now, IgniteOOM means that the whole cluster should be restarted.
>
> Thanks,
> Mikhail.
>
> On Thu, Dec 14, 2017 at 2:03 AM, Denis Magda <dmagda@apache.org> wrote:
>
> > Hello Mikhail,
> >
> > This problem is related to the discussion around Ignite internal problems
> > and their possible resolution:
> > http://apache-ignite-developers.2346864.n4.nabble.com/Internal-problems-
> > requiring-graceful-node-shutdown-reboot-etc-td24856.html <
> > http://apache-ignite-developers.2346864.n4.nabble.com/Internal-problems-
> > requiring-graceful-node-shutdown-reboot-etc-td24856.html>
> >
> > Referring to that discussion, I would define a special
> IgniteFailureAction
> > in response to IgniteOOM (IgniteFailureCause in terms of the new API).
> The
> > action can purge, wipe out the page memory or do another extra steps.
> >
> > —
> > Denis
> >
> > > On Dec 13, 2017, at 9:14 AM, Mikhail Cherkasov <
> mcherkasov@gridgain.com>
> > wrote:
> > >
> > > Hi all,
> > >
> > > I faced with a problem that if Ignite has no memory and IgniteOOM was
> > > thrown, there's no way to continues work with a cluster.
> > >
> > > You cannot remove some part of data to free some space because during
> > > removing Ignite tries to move pages to a free list and free list tries
> > > to acquire more pages, but there's no more space for this.
> > >
> > > Ignite can not revert transactions properly due to the same reason.
> > > If  IgniteOOM occurs during transaction Ignite will try to revert
> already
> > > applied changes and as result will move some pages to free list and
> > there's
> > > the same problem as above, no space for the free list too.
> > >
> > > And you even cannot add more nodes, because after rebalancing ignite
> will
> > > try to evict pages and this means again we need to a space for free
> list:
> > > https://issues.apache.org/jira/browse/IGNITE-7019
> > >
> > > Do you have ideas how we can properly handle this?
> > >
> > > --
> > > Thanks,
> > > Mikhail.
> >
> >
>
>
> --
> Thanks,
> Mikhail.
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message