Return-Path: X-Original-To: apmail-drill-issues-archive@minotaur.apache.org Delivered-To: apmail-drill-issues-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 4EB5718B15 for ; Thu, 14 Jan 2016 23:12:40 +0000 (UTC) Received: (qmail 96984 invoked by uid 500); 14 Jan 2016 23:12:40 -0000 Delivered-To: apmail-drill-issues-archive@drill.apache.org Received: (qmail 96850 invoked by uid 500); 14 Jan 2016 23:12:40 -0000 Mailing-List: contact issues-help@drill.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@drill.apache.org Delivered-To: mailing list issues@drill.apache.org Received: (qmail 96724 invoked by uid 99); 14 Jan 2016 23:12:40 -0000 Received: from arcas.apache.org (HELO arcas) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 14 Jan 2016 23:12:40 +0000 Received: from arcas.apache.org (localhost [127.0.0.1]) by arcas (Postfix) with ESMTP id D5A262C1F5B for ; Thu, 14 Jan 2016 23:12:39 +0000 (UTC) Date: Thu, 14 Jan 2016 23:12:39 +0000 (UTC) From: "Victoria Markman (JIRA)" To: issues@drill.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Created] (DRILL-4272) When sort runs out of memory and query fails, resources are seemingly not freed MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 Victoria Markman created DRILL-4272: --------------------------------------- Summary: When sort runs out of memory and query fails, resources are seemingly not freed Key: DRILL-4272 URL: https://issues.apache.org/jira/browse/DRILL-4272 Project: Apache Drill Issue Type: Bug Components: Execution - Relational Operators Affects Versions: 1.5.0 Reporter: Victoria Markman Priority: Critical Executed query11.sql from resources/Advanced/tpcds/tpcds_sf1/original/parquet Query runs out of memory: {code} Error: RESOURCE ERROR: One or more nodes ran out of memory while executing the query. Unable to allocate sv2 for 32768 records, and not enough batchGroups to spill. batchGroups.size 1 spilledBatchGroups.size 0 allocated memory 19961472 allocator limit 20000000 Fragment 19:0 [Error Id: 87aa32b8-17eb-488e-90cb-5f5bffff9aec on atsqa4-133.qa.lab:31010] (state=,code=0) {code} And leaves fragments running, holding resources: {code} 2016-01-14 22:46:32,435 [Drillbit-ShutdownHook#0] INFO o.apache.drill.exec.server.Drillbit - Received shutdown request. 2016-01-14 22:46:32,546 [Curator-ServiceCache-0] WARN o.a.d.e.w.fragment.FragmentExecutor - Foreman atsqa4-136.qa.lab no longer active. Cancelling fragment 2967db08-cd38-925a-4960-9e881f537af8:19:0. 2016-01-14 22:46:32,547 [Curator-ServiceCache-0] INFO o.a.d.e.w.fragment.FragmentExecutor - 2967db08-cd38-925a-4960-9e881f537af8:19:0: State change requested CANCELLATION_REQUESTED --> CANCELLATION_REQUESTED 2016-01-14 22:46:32,547 [Curator-ServiceCache-0] WARN o.a.d.e.w.fragment.FragmentExecutor - 2967db08-cd38-925a-4960-9e881f537af8:19:0: Ignoring unexpected state transition CANCELLATION_REQUESTED --> CANCELLATION_REQUESTED 2016-01-14 22:46:32,547 [Curator-ServiceCache-0] WARN o.a.d.e.w.fragment.FragmentExecutor - Foreman atsqa4-136.qa.lab no longer active. Cancelling fragment 2967db08-cd38-925a-4960-9e881f537af8:17:0. 2016-01-14 22:46:32,547 [Curator-ServiceCache-0] INFO o.a.d.e.w.fragment.FragmentExecutor - 2967db08-cd38-925a-4960-9e881f537af8:17:0: State change requested CANCELLATION_REQUESTED --> CANCELLATION_REQUESTED 2016-01-14 22:46:32,547 [Curator-ServiceCache-0] WARN o.a.d.e.w.fragment.FragmentExecutor - 2967db08-cd38-925a-4960-9e881f537af8:17:0: Ignoring unexpected state transition CANCELLATION_REQUESTED --> CANCELLATION_REQUESTED 2016-01-14 22:46:33,563 [BitServer-1] INFO o.a.d.exec.rpc.control.ControlClient - Channel closed /10.10.88.134:59069 <--> atsqa4-136.qa.lab/10.10.88.136:31011. 2016-01-14 22:46:33,563 [BitClient-1] INFO o.a.drill.exec.rpc.data.DataClient - Channel closed /10.10.88.134:34802 <--> atsqa4-136.qa.lab/10.10.88.136:31012. 2016-01-14 22:46:33,590 [BitClient-1] INFO o.a.drill.exec.rpc.data.DataClient - Channel closed /10.10.88.134:36937 <--> atsqa4-135.qa.lab/10.10.88.135:31012. 2016-01-14 22:46:33,595 [BitClient-1] INFO o.a.drill.exec.rpc.data.DataClient - Channel closed /10.10.88.134:53860 <--> atsqa4-133.qa.lab/10.10.88.133:31012. 2016-01-14 22:46:38,467 [BitClient-1] INFO o.a.drill.exec.rpc.data.DataClient - Channel closed /10.10.88.134:48276 <--> atsqa4-134.qa.lab/10.10.88.134:31012. 2016-01-14 22:46:39,470 [pool-6-thread-1] INFO o.a.drill.exec.rpc.user.UserServer - closed eventLoopGroup io.netty.channel.nio.NioEventLoopGroup@6fb32dfb in 1003 ms 2016-01-14 22:46:39,470 [pool-6-thread-2] INFO o.a.drill.exec.rpc.data.DataServer - closed eventLoopGroup io.netty.channel.nio.NioEventLoopGroup@5c93dd80 in 1003 ms 2016-01-14 22:46:39,470 [pool-6-thread-1] INFO o.a.drill.exec.service.ServiceEngine - closed userServer in 1004 ms 2016-01-14 22:46:39,470 [pool-6-thread-2] INFO o.a.drill.exec.service.ServiceEngine - closed dataPool in 1005 ms 2016-01-14 22:46:39,483 [Drillbit-ShutdownHook#0] WARN o.apache.drill.exec.work.WorkManager - Closing WorkManager but there are 2 running fragments. 2016-01-14 22:46:41,489 [Drillbit-ShutdownHook#0] ERROR o.a.d.exec.server.BootStrapContext - Pool did not terminate 2016-01-14 22:46:41,498 [Drillbit-ShutdownHook#0] WARN o.apache.drill.exec.server.Drillbit - Failure on close() java.lang.RuntimeException: Exception while closing at org.apache.drill.common.DrillAutoCloseables.closeNoChecked(DrillAutoCloseables.java:46) ~[drill-common-1.5.0-SNAPSHOT.jar:1.5.0-SNAPSHOT] at org.apache.drill.exec.server.BootStrapContext.close(BootStrapContext.java:127) ~[drill-java-exec-1.5.0-SNAPSHOT.jar:1.5.0-SNAPSHOT] at org.apache.drill.common.AutoCloseables.close(AutoCloseables.java:76) ~[drill-common-1.5.0-SNAPSHOT.jar:1.5.0-SNAPSHOT] at org.apache.drill.common.AutoCloseables.close(AutoCloseables.java:64) ~[drill-common-1.5.0-SNAPSHOT.jar:1.5.0-SNAPSHOT] at org.apache.drill.exec.server.Drillbit.close(Drillbit.java:149) ~[drill-java-exec-1.5.0-SNAPSHOT.jar:1.5.0-SNAPSHOT] at org.apache.drill.exec.server.Drillbit$ShutdownThread.run(Drillbit.java:243) [drill-java-exec-1.5.0-SNAPSHOT.jar:1.5.0-SNAPSHOT] Caused by: java.util.ConcurrentModificationException: null at java.util.IdentityHashMap$IdentityHashMapIterator.nextIndex(IdentityHashMap.java:732) ~[na:1.7.0_71] at java.util.IdentityHashMap$KeyIterator.next(IdentityHashMap.java:822) ~[na:1.7.0_71] at org.apache.drill.exec.memory.BaseAllocator.print(BaseAllocator.java:683) ~[drill-memory-base-1.5.0-SNAPSHOT.jar:1.5.0-SNAPSHOT] at org.apache.drill.exec.memory.BaseAllocator.print(BaseAllocator.java:679) ~[drill-memory-base-1.5.0-SNAPSHOT.jar:1.5.0-SNAPSHOT] at org.apache.drill.exec.memory.BaseAllocator.print(BaseAllocator.java:679) ~[drill-memory-base-1.5.0-SNAPSHOT.jar:1.5.0-SNAPSHOT] at org.apache.drill.exec.memory.BaseAllocator.toString(BaseAllocator.java:483) ~[drill-memory-base-1.5.0-SNAPSHOT.jar:1.5.0-SNAPSHOT] at org.apache.drill.exec.memory.BaseAllocator.close(BaseAllocator.java:431) ~[drill-memory-base-1.5.0-SNAPSHOT.jar:1.5.0-SNAPSHOT] at org.apache.drill.common.DrillAutoCloseables.closeNoChecked(DrillAutoCloseables.java:44) ~[drill-common-1.5.0-SNAPSHOT.jar:1.5.0-SNAPSHOT] ... 5 common frames omitted 2016-01-14 22:46:41,499 [Drillbit-ShutdownHook#0] INFO o.apache.drill.exec.server.Drillbit - Shutdown completed (9062 ms). {code} My setup (I'm sure it can be scaled down to one node) * 4 node cluster * each box is 32 cores * 48 GB direct memory * 10GB memory allocated to sort Drill settings: {code} 0: jdbc:drill:schema=dfs> select * from sys.options where status like '%CHANGED%'; +-------------------------------------------+----------+---------+----------+--------------+-------------+-----------+------------+ | name | kind | type | status | num_val | string_val | bool_val | float_val | +-------------------------------------------+----------+---------+----------+--------------+-------------+-----------+------------+ | planner.enable_decimal_data_type | BOOLEAN | SYSTEM | CHANGED | null | null | true | null | | planner.enable_hashjoin | BOOLEAN | SYSTEM | CHANGED | null | null | false | null | | planner.memory.max_query_memory_per_node | LONG | SYSTEM | CHANGED | 10737418240 | null | null | null | +-------------------------------------------+----------+---------+----------+--------------+-------------+-----------+------------+ 3 rows selected (3.532 seconds) {code} Or maybe we don't wait for anything to complete and just shut everything down ? -- This message was sent by Atlassian JIRA (v6.3.4#6332)