ignite-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Vladimir Pligin (Jira)" <j...@apache.org>
Subject [jira] [Commented] (IGNITE-14248) Handle exceptions in PartitionReservationManager.onDoneAfterTopologyUnlock properly
Date Tue, 01 Jun 2021 13:59:00 GMT

    [ https://issues.apache.org/jira/browse/IGNITE-14248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17355126#comment-17355126
] 

Vladimir Pligin commented on IGNITE-14248:
------------------------------------------

Hi [~slava.koptilin],

It's all good now.

> Handle exceptions in PartitionReservationManager.onDoneAfterTopologyUnlock properly
> -----------------------------------------------------------------------------------
>
>                 Key: IGNITE-14248
>                 URL: https://issues.apache.org/jira/browse/IGNITE-14248
>             Project: Ignite
>          Issue Type: Improvement
>          Components: cache
>    Affects Versions: 2.9.1
>            Reporter: Vladimir Pligin
>            Assignee: Vladimir Pligin
>            Priority: Major
>          Time Spent: 20m
>  Remaining Estimate: 0h
>
> If an exception (or even Error) is thrown inside of the method then the node turns into
some unrecoverable state. Here's an example.
>  # an exchange is about to finish, it's time to invalidate partition reservations.
>  # exchange thread delegates it to a thread in the management pool
>  # management pool tries to allocate a new thread (maybe it's idle and therefore empty)
>  # for example ulimit is reached, the error is 
>  java.lang.OutOfMemoryError: unable to create native thread: possibly out of memory or
process/resource limits reached
>  # It's being logged, no further action is taken
>  # partitions are reserved forever
> Message:
>  
> {code:java}
> 2021-02-25 05:52:03.242 [exchange-worker-#182] ERROR o.a.i.i.p.q.h.t.PartitionReservationManager
- Unexpected exception on start reservations cleanup
> java.lang.OutOfMemoryError: unable to create native thread: possibly out of memory or
process/resource limits reached
> 	at java.base/java.lang.Thread.start0(Native Method)
> 	at java.base/java.lang.Thread.start(Thread.java:803)
> 	at java.base/java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:937)
> 	at java.base/java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1343)
> 	at org.apache.ignite.internal.processors.closure.GridClosureProcessor.runLocal(GridClosureProcessor.java:847)
> 	at org.apache.ignite.internal.processors.query.h2.twostep.PartitionReservationManager.onDoneAfterTopologyUnlock(PartitionReservationManager.java:323)
> 	at org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture.onDone(GridDhtPartitionsExchangeFuture.java:2617)
> 	at org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture.onDone(GridDhtPartitionsExchangeFuture.java:159)
> 	at org.apache.ignite.internal.util.future.GridFutureAdapter.onDone(GridFutureAdapter.java:475)
> 	at org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture.init(GridDhtPartitionsExchangeFuture.java:1064)
> 	at org.apache.ignite.internal.processors.cache.GridCachePartitionExchangeManager$ExchangeWorker.body0(GridCachePartitionExchangeManager.java:3375)
> 	at org.apache.ignite.internal.processors.cache.GridCachePartitionExchangeManager$ExchangeWorker.body(GridCachePartitionExchangeManager.java:3194)
> 	at org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:119)
> 	at java.base/java.lang.Thread.run(Thread.java:834)
> {code}
>  
>  
> Code of PartitionReservationManager.onDoneAfterTopologyUnlock:
> {code:java}
> @Override public void onDoneAfterTopologyUnlock(final GridDhtPartitionsExchangeFuture
fut) {
>         try {
>             // Must not do anything at the exchange thread. Dispatch to the management
thread pool.
>             ctx.closure().runLocal(() -> {
>                     AffinityTopologyVersion topVer = ctx.cache().context().exchange()
>                         .lastAffinityChangedTopologyVersion(fut.topologyVersion()); 
                  reservations.forEach((key, r) -> {
>                         if (r != REPLICATED_RESERVABLE && !F.eq(key.topologyVersion(),
topVer)) {
>                             assert r instanceof GridDhtPartitionsReservation;       
                    ((GridDhtPartitionsReservation)r).invalidate();
>                         }
>                     });
>                 },
>                 GridIoPolicy.MANAGEMENT_POOL);
>         }
>         catch (Throwable e) {
>             log.error("Unexpected exception on start reservations cleanup", e);
>         }
>     }
> {code}
>  
>  
> My vision is that there are two basic approaches:
>  * to kill the node (it's already non-functional at this point), seems to be a FH job.
>  * try to recover somehow (to be honest it's not clear how exactly)
> This particular OOM situation seems unrecoverable in fact. It's an environment misconfiguration.
It would be great to investigate if potentially recoverable exceptions are possible to be
raised inside this block. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Mime
View raw message