asterixdb-notifications mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Michael J. Carey (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (ASTERIXDB-2145) Recovery process fails on 100 datasets
Date Wed, 25 Oct 2017 21:46:00 GMT

    [ https://issues.apache.org/jira/browse/ASTERIXDB-2145?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16219598#comment-16219598
] 

Michael J. Carey commented on ASTERIXDB-2145:
---------------------------------------------

This was a good workaround and will actually be fine for Cloudberry - but the smaller the
components, the worse the write amplification will be.  As has been discussed on a related
e-mail thread, the right solution is for recovery not to try to have all of the datasets active
simultaneously - that's a broken approach to recovery - we should be able to recover independently
of the number of datasets.  AsterixDB users should be setting their component size parameter
based on their expectations for the data - how much there will be, how they want to trade
off write amplification vs. query efficiency, etc. 

> Recovery process fails on 100 datasets
> --------------------------------------
>
>                 Key: ASTERIXDB-2145
>                 URL: https://issues.apache.org/jira/browse/ASTERIXDB-2145
>             Project: Apache AsterixDB
>          Issue Type: Bug
>            Reporter: Taewoo Kim
>
> On the Cloudberry DB, currently, there are 112 datasets on a dataverse. When restarting
that instance, the NC showed the following error and stopped. 
> java.lang.IllegalStateException: Failed to redo
> at org.apache.asterix.app.nc.RecoveryManager.redo(RecoveryManager.java:712)
> at org.apache.asterix.app.nc.RecoveryManager.startRecoveryRedoPhase(RecoveryManager.java:378)
> at org.apache.asterix.app.nc.RecoveryManager.replayPartitionsLogs(RecoveryManager.java:187)
> at org.apache.asterix.app.nc.RecoveryManager.startLocalRecovery(RecoveryManager.java:179)
> at org.apache.asterix.app.nc.task.LocalRecoveryTask.perform(LocalRecoveryTask.java:43)
> at org.apache.asterix.app.replication.message.StartupTaskResponseMessage.handle(StartupTaskResponseMessage.java:56)
> at org.apache.asterix.messaging.NCMessageBroker.receivedMessage(NCMessageBroker.java:92)
> at org.apache.hyracks.control.nc.work.ApplicationMessageWork.run(ApplicationMessageWork.java:51)
> at org.apache.hyracks.control.common.work.WorkQueue$WorkerThread.run(WorkQueue.java:127)
> Caused by: org.apache.hyracks.api.exceptions.HyracksDataException:
> Cannot allocate dataset 191 memory since memory budget would be
> exceeded.
> at org.apache.asterix.common.context.DatasetLifecycleManager.allocateMemory(DatasetLifecycleManager.java:568)
> at org.apache.hyracks.storage.common.buffercache.ResourceHeapBufferAllocator.reserveAllocation(ResourceHeapBufferAllocator.java:53)
> at org.apache.hyracks.storage.am.lsm.common.impls.VirtualBufferCache.open(VirtualBufferCache.java:307)
> at org.apache.hyracks.storage.am.lsm.common.impls.MultitenantVirtualBufferCache.open(MultitenantVirtualBufferCache.java:119)
> at org.apache.hyracks.storage.am.lsm.btree.impls.LSMBTree.allocateMemoryComponent(LSMBTree.java:611)
> at org.apache.hyracks.storage.am.lsm.common.impls.AbstractLSMIndex.allocateMemoryComponents(AbstractLSMIndex.java:389)
> at org.apache.hyracks.storage.am.lsm.common.impls.LSMHarness.modify(LSMHarness.java:421)
> at org.apache.hyracks.storage.am.lsm.common.impls.LSMHarness.forceModify(LSMHarness.java:368)
> at org.apache.hyracks.storage.am.lsm.common.impls.LSMTreeIndexAccessor.forceUpsert(LSMTreeIndexAccessor.java:181)
> at org.apache.asterix.app.nc.RecoveryManager.redo(RecoveryManager.java:707)
> ... 8 more
> So, I increased the storage.memorycomponent.globalbudget parameter from 3GB to 5GB. Still,
the NC showed the following error and the recovery process could not finish. 
> ... similar log records ...
> Oct 25, 2017 9:33:44 AM org.apache.asterix.transaction.management.resource.PersistentLocalResourceRepository
loadDataverse
> INFO: Loading dataverse:berry
> Oct 25, 2017 9:33:44 AM org.apache.asterix.transaction.management.resource.PersistentLocalResourceRepository
loadIndex
> INFO: Loading index:meta_idx_meta
> Oct 25, 2017 9:33:44 AM org.apache.asterix.transaction.management.resource.PersistentLocalResourceRepository
loadIndex
> INFO: Resource loaded 161:storage/partition_1/berry/meta_idx_meta
> Oct 25, 2017 9:34:09 AM org.apache.hyracks.util.ExitUtil$ExitThread run
> INFO: JVM exiting with status 2; bye!
> So, I checked the parameter information page and found that the default parameter for
storage.memorycomponent.numpages is 1/16 of the global component budget. Therefore, I decreased
this parameter to increase the number of datasets in memory. And the instance was finally
able to start. So, it seems that the recovery process tries to load and keep all datasets
into memory and this needs to be checked.  



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Mime
View raw message