Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id ABEB1200D27 for ; Wed, 25 Oct 2017 23:46:04 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id AA56F1609CE; Wed, 25 Oct 2017 21:46:04 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id EFFE6160BE0 for ; Wed, 25 Oct 2017 23:46:03 +0200 (CEST) Received: (qmail 55370 invoked by uid 500); 25 Oct 2017 21:46:03 -0000 Mailing-List: contact notifications-help@asterixdb.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@asterixdb.apache.org Delivered-To: mailing list notifications@asterixdb.apache.org Received: (qmail 55361 invoked by uid 99); 25 Oct 2017 21:46:03 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd1-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 25 Oct 2017 21:46:03 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd1-us-west.apache.org (ASF Mail Server at spamd1-us-west.apache.org) with ESMTP id 648B3C4672 for ; Wed, 25 Oct 2017 21:46:02 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd1-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: -100.002 X-Spam-Level: X-Spam-Status: No, score=-100.002 tagged_above=-999 required=6.31 tests=[RP_MATCHES_RCVD=-0.001, SPF_PASS=-0.001, USER_IN_WHITELIST=-100] autolearn=disabled Received: from mx1-lw-us.apache.org ([10.40.0.8]) by localhost (spamd1-us-west.apache.org [10.40.0.7]) (amavisd-new, port 10024) with ESMTP id qCP8jlfH2b3g for ; Wed, 25 Oct 2017 21:46:01 +0000 (UTC) Received: from mailrelay1-us-west.apache.org (mailrelay1-us-west.apache.org [209.188.14.139]) by mx1-lw-us.apache.org (ASF Mail Server at mx1-lw-us.apache.org) with ESMTP id 1C0895FD81 for ; Wed, 25 Oct 2017 21:46:00 +0000 (UTC) Received: from jira-lw-us.apache.org (unknown [207.244.88.139]) by mailrelay1-us-west.apache.org (ASF Mail Server at mailrelay1-us-west.apache.org) with ESMTP id 844C3E0D5C for ; Wed, 25 Oct 2017 21:46:00 +0000 (UTC) Received: from jira-lw-us.apache.org (localhost [127.0.0.1]) by jira-lw-us.apache.org (ASF Mail Server at jira-lw-us.apache.org) with ESMTP id 417B5212FB for ; Wed, 25 Oct 2017 21:46:00 +0000 (UTC) Date: Wed, 25 Oct 2017 21:46:00 +0000 (UTC) From: "Michael J. Carey (JIRA)" To: notifications@asterixdb.incubator.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Assigned] (ASTERIXDB-2145) Recovery process fails on 100 datasets MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 archived-at: Wed, 25 Oct 2017 21:46:04 -0000 [ https://issues.apache.org/jira/browse/ASTERIXDB-2145?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael J. Carey reassigned ASTERIXDB-2145: ------------------------------------------- Assignee: Ian Maxon Ian, do you want to take a crack at this one? > Recovery process fails on 100 datasets > -------------------------------------- > > Key: ASTERIXDB-2145 > URL: https://issues.apache.org/jira/browse/ASTERIXDB-2145 > Project: Apache AsterixDB > Issue Type: Bug > Reporter: Taewoo Kim > Assignee: Ian Maxon > > On the Cloudberry DB, currently, there are 112 datasets on a dataverse. When restarting that instance, the NC showed the following error and stopped. > java.lang.IllegalStateException: Failed to redo > at org.apache.asterix.app.nc.RecoveryManager.redo(RecoveryManager.java:712) > at org.apache.asterix.app.nc.RecoveryManager.startRecoveryRedoPhase(RecoveryManager.java:378) > at org.apache.asterix.app.nc.RecoveryManager.replayPartitionsLogs(RecoveryManager.java:187) > at org.apache.asterix.app.nc.RecoveryManager.startLocalRecovery(RecoveryManager.java:179) > at org.apache.asterix.app.nc.task.LocalRecoveryTask.perform(LocalRecoveryTask.java:43) > at org.apache.asterix.app.replication.message.StartupTaskResponseMessage.handle(StartupTaskResponseMessage.java:56) > at org.apache.asterix.messaging.NCMessageBroker.receivedMessage(NCMessageBroker.java:92) > at org.apache.hyracks.control.nc.work.ApplicationMessageWork.run(ApplicationMessageWork.java:51) > at org.apache.hyracks.control.common.work.WorkQueue$WorkerThread.run(WorkQueue.java:127) > Caused by: org.apache.hyracks.api.exceptions.HyracksDataException: > Cannot allocate dataset 191 memory since memory budget would be > exceeded. > at org.apache.asterix.common.context.DatasetLifecycleManager.allocateMemory(DatasetLifecycleManager.java:568) > at org.apache.hyracks.storage.common.buffercache.ResourceHeapBufferAllocator.reserveAllocation(ResourceHeapBufferAllocator.java:53) > at org.apache.hyracks.storage.am.lsm.common.impls.VirtualBufferCache.open(VirtualBufferCache.java:307) > at org.apache.hyracks.storage.am.lsm.common.impls.MultitenantVirtualBufferCache.open(MultitenantVirtualBufferCache.java:119) > at org.apache.hyracks.storage.am.lsm.btree.impls.LSMBTree.allocateMemoryComponent(LSMBTree.java:611) > at org.apache.hyracks.storage.am.lsm.common.impls.AbstractLSMIndex.allocateMemoryComponents(AbstractLSMIndex.java:389) > at org.apache.hyracks.storage.am.lsm.common.impls.LSMHarness.modify(LSMHarness.java:421) > at org.apache.hyracks.storage.am.lsm.common.impls.LSMHarness.forceModify(LSMHarness.java:368) > at org.apache.hyracks.storage.am.lsm.common.impls.LSMTreeIndexAccessor.forceUpsert(LSMTreeIndexAccessor.java:181) > at org.apache.asterix.app.nc.RecoveryManager.redo(RecoveryManager.java:707) > ... 8 more > So, I increased the storage.memorycomponent.globalbudget parameter from 3GB to 5GB. Still, the NC showed the following error and the recovery process could not finish. > ... similar log records ... > Oct 25, 2017 9:33:44 AM org.apache.asterix.transaction.management.resource.PersistentLocalResourceRepository loadDataverse > INFO: Loading dataverse:berry > Oct 25, 2017 9:33:44 AM org.apache.asterix.transaction.management.resource.PersistentLocalResourceRepository loadIndex > INFO: Loading index:meta_idx_meta > Oct 25, 2017 9:33:44 AM org.apache.asterix.transaction.management.resource.PersistentLocalResourceRepository loadIndex > INFO: Resource loaded 161:storage/partition_1/berry/meta_idx_meta > Oct 25, 2017 9:34:09 AM org.apache.hyracks.util.ExitUtil$ExitThread run > INFO: JVM exiting with status 2; bye! > So, I checked the parameter information page and found that the default parameter for storage.memorycomponent.numpages is 1/16 of the global component budget. Therefore, I decreased this parameter to increase the number of datasets in memory. And the instance was finally able to start. So, it seems that the recovery process tries to load and keep all datasets into memory and this needs to be checked. -- This message was sent by Atlassian JIRA (v6.4.14#64029)