drill-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Paul Rogers (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (DRILL-4272) When sort runs out of memory and query fails, resources are seemingly not freed
Date Tue, 28 Mar 2017 23:25:41 GMT

    [ https://issues.apache.org/jira/browse/DRILL-4272?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15946183#comment-15946183
] 

Paul Rogers commented on DRILL-4272:
------------------------------------

Bug was filed in Jan. of 2016 against 1.5. Did you try that version? Else, we can assume that
the bug has since been fixed.

The important point is that the problem does not occur in the current version.

We can change the resolution to "not a bug" or "can'r reproduce" if that is more descriptive...

> When sort runs out of memory and query fails, resources are seemingly not freed
> -------------------------------------------------------------------------------
>
>                 Key: DRILL-4272
>                 URL: https://issues.apache.org/jira/browse/DRILL-4272
>             Project: Apache Drill
>          Issue Type: Sub-task
>          Components: Execution - Relational Operators
>    Affects Versions: 1.5.0
>            Reporter: Victoria Markman
>            Assignee: Paul Rogers
>            Priority: Critical
>             Fix For: 1.10.0
>
>
> Executed query11.sql from resources/Advanced/tpcds/tpcds_sf1/original/parquet
> Query runs out of memory:
> {code}
> Error: RESOURCE ERROR: One or more nodes ran out of memory while executing the query.
> Unable to allocate sv2 for 32768 records, and not enough batchGroups to spill.
> batchGroups.size 1
> spilledBatchGroups.size 0
> allocated memory 19961472
> allocator limit 20000000
> Fragment 19:0
> [Error Id: 87aa32b8-17eb-488e-90cb-5f5bffff9aec on atsqa4-133.qa.lab:31010] (state=,code=0)
> {code}
> And leaves fragments running, holding resources:
> {code}
> 2016-01-14 22:46:32,435 [Drillbit-ShutdownHook#0] INFO  o.apache.drill.exec.server.Drillbit
- Received shutdown request.
> 2016-01-14 22:46:32,546 [Curator-ServiceCache-0] WARN  o.a.d.e.w.fragment.FragmentExecutor
- Foreman atsqa4-136.qa.lab no longer active.  Cancelling fragment 2967db08-cd38-925a-4960-9e881f537af8:19:0.
> 2016-01-14 22:46:32,547 [Curator-ServiceCache-0] INFO  o.a.d.e.w.fragment.FragmentExecutor
- 2967db08-cd38-925a-4960-9e881f537af8:19:0: State change requested CANCELLATION_REQUESTED
--> CANCELLATION_REQUESTED
> 2016-01-14 22:46:32,547 [Curator-ServiceCache-0] WARN  o.a.d.e.w.fragment.FragmentExecutor
- 2967db08-cd38-925a-4960-9e881f537af8:19:0: Ignoring unexpected state transition CANCELLATION_REQUESTED
--> CANCELLATION_REQUESTED
> 2016-01-14 22:46:32,547 [Curator-ServiceCache-0] WARN  o.a.d.e.w.fragment.FragmentExecutor
- Foreman atsqa4-136.qa.lab no longer active.  Cancelling fragment 2967db08-cd38-925a-4960-9e881f537af8:17:0.
> 2016-01-14 22:46:32,547 [Curator-ServiceCache-0] INFO  o.a.d.e.w.fragment.FragmentExecutor
- 2967db08-cd38-925a-4960-9e881f537af8:17:0: State change requested CANCELLATION_REQUESTED
--> CANCELLATION_REQUESTED
> 2016-01-14 22:46:32,547 [Curator-ServiceCache-0] WARN  o.a.d.e.w.fragment.FragmentExecutor
- 2967db08-cd38-925a-4960-9e881f537af8:17:0: Ignoring unexpected state transition CANCELLATION_REQUESTED
--> CANCELLATION_REQUESTED
> 2016-01-14 22:46:33,563 [BitServer-1] INFO  o.a.d.exec.rpc.control.ControlClient - Channel
closed /10.10.88.134:59069 <--> atsqa4-136.qa.lab/10.10.88.136:31011.
> 2016-01-14 22:46:33,563 [BitClient-1] INFO  o.a.drill.exec.rpc.data.DataClient - Channel
closed /10.10.88.134:34802 <--> atsqa4-136.qa.lab/10.10.88.136:31012.
> 2016-01-14 22:46:33,590 [BitClient-1] INFO  o.a.drill.exec.rpc.data.DataClient - Channel
closed /10.10.88.134:36937 <--> atsqa4-135.qa.lab/10.10.88.135:31012.
> 2016-01-14 22:46:33,595 [BitClient-1] INFO  o.a.drill.exec.rpc.data.DataClient - Channel
closed /10.10.88.134:53860 <--> atsqa4-133.qa.lab/10.10.88.133:31012.
> 2016-01-14 22:46:38,467 [BitClient-1] INFO  o.a.drill.exec.rpc.data.DataClient - Channel
closed /10.10.88.134:48276 <--> atsqa4-134.qa.lab/10.10.88.134:31012.
> 2016-01-14 22:46:39,470 [pool-6-thread-1] INFO  o.a.drill.exec.rpc.user.UserServer -
closed eventLoopGroup io.netty.channel.nio.NioEventLoopGroup@6fb32dfb in 1003 ms
> 2016-01-14 22:46:39,470 [pool-6-thread-2] INFO  o.a.drill.exec.rpc.data.DataServer -
closed eventLoopGroup io.netty.channel.nio.NioEventLoopGroup@5c93dd80 in 1003 ms
> 2016-01-14 22:46:39,470 [pool-6-thread-1] INFO  o.a.drill.exec.service.ServiceEngine
- closed userServer in 1004 ms
> 2016-01-14 22:46:39,470 [pool-6-thread-2] INFO  o.a.drill.exec.service.ServiceEngine
- closed dataPool in 1005 ms
> 2016-01-14 22:46:39,483 [Drillbit-ShutdownHook#0] WARN  o.apache.drill.exec.work.WorkManager
- Closing WorkManager but there are 2 running fragments.
> 2016-01-14 22:46:41,489 [Drillbit-ShutdownHook#0] ERROR o.a.d.exec.server.BootStrapContext
- Pool did not terminate
> 2016-01-14 22:46:41,498 [Drillbit-ShutdownHook#0] WARN  o.apache.drill.exec.server.Drillbit
- Failure on close()
> java.lang.RuntimeException: Exception while closing
>         at org.apache.drill.common.DrillAutoCloseables.closeNoChecked(DrillAutoCloseables.java:46)
~[drill-common-1.5.0-SNAPSHOT.jar:1.5.0-SNAPSHOT]
>         at org.apache.drill.exec.server.BootStrapContext.close(BootStrapContext.java:127)
~[drill-java-exec-1.5.0-SNAPSHOT.jar:1.5.0-SNAPSHOT]
>         at org.apache.drill.common.AutoCloseables.close(AutoCloseables.java:76) ~[drill-common-1.5.0-SNAPSHOT.jar:1.5.0-SNAPSHOT]
>         at org.apache.drill.common.AutoCloseables.close(AutoCloseables.java:64) ~[drill-common-1.5.0-SNAPSHOT.jar:1.5.0-SNAPSHOT]
>         at org.apache.drill.exec.server.Drillbit.close(Drillbit.java:149) ~[drill-java-exec-1.5.0-SNAPSHOT.jar:1.5.0-SNAPSHOT]
>         at org.apache.drill.exec.server.Drillbit$ShutdownThread.run(Drillbit.java:243)
[drill-java-exec-1.5.0-SNAPSHOT.jar:1.5.0-SNAPSHOT]
> Caused by: java.util.ConcurrentModificationException: null
>         at java.util.IdentityHashMap$IdentityHashMapIterator.nextIndex(IdentityHashMap.java:732)
~[na:1.7.0_71]
>         at java.util.IdentityHashMap$KeyIterator.next(IdentityHashMap.java:822) ~[na:1.7.0_71]
>         at org.apache.drill.exec.memory.BaseAllocator.print(BaseAllocator.java:683) ~[drill-memory-base-1.5.0-SNAPSHOT.jar:1.5.0-SNAPSHOT]
>         at org.apache.drill.exec.memory.BaseAllocator.print(BaseAllocator.java:679) ~[drill-memory-base-1.5.0-SNAPSHOT.jar:1.5.0-SNAPSHOT]
>         at org.apache.drill.exec.memory.BaseAllocator.print(BaseAllocator.java:679) ~[drill-memory-base-1.5.0-SNAPSHOT.jar:1.5.0-SNAPSHOT]
>         at org.apache.drill.exec.memory.BaseAllocator.toString(BaseAllocator.java:483)
~[drill-memory-base-1.5.0-SNAPSHOT.jar:1.5.0-SNAPSHOT]
>         at org.apache.drill.exec.memory.BaseAllocator.close(BaseAllocator.java:431) ~[drill-memory-base-1.5.0-SNAPSHOT.jar:1.5.0-SNAPSHOT]
>         at org.apache.drill.common.DrillAutoCloseables.closeNoChecked(DrillAutoCloseables.java:44)
~[drill-common-1.5.0-SNAPSHOT.jar:1.5.0-SNAPSHOT]
>         ... 5 common frames omitted
> 2016-01-14 22:46:41,499 [Drillbit-ShutdownHook#0] INFO  o.apache.drill.exec.server.Drillbit
- Shutdown completed (9062 ms).
> {code}
> or this stack trace:
> {code}
> 2016-01-15 00:59:09,699 [Drillbit-ShutdownHook#0] WARN  o.apache.drill.exec.work.WorkManager
- Closing WorkManager but there are 1 running fragments.
> 2016-01-15 00:59:11,704 [Drillbit-ShutdownHook#0] ERROR o.a.d.exec.server.BootStrapContext
- Pool did not terminate
> 2016-01-15 00:59:11,708 [Drillbit-ShutdownHook#0] WARN  o.apache.drill.exec.server.Drillbit
- Failure on close()
> java.lang.RuntimeException: Exception while closing
>         at org.apache.drill.common.DrillAutoCloseables.closeNoChecked(DrillAutoCloseables.java:46)
~[drill-common-1.5.0-SNAPSHOT.jar:1.5.0-SNAPSHOT]
>         at org.apache.drill.exec.server.BootStrapContext.close(BootStrapContext.java:127)
~[drill-java-exec-1.5.0-SNAPSHOT.jar:1.5.0-SNAPSHOT]
>         at org.apache.drill.common.AutoCloseables.close(AutoCloseables.java:76) ~[drill-common-1.5.0-SNAPSHOT.jar:1.5.0-SNAPSHOT]
>         at org.apache.drill.common.AutoCloseables.close(AutoCloseables.java:64) ~[drill-common-1.5.0-SNAPSHOT.jar:1.5.0-SNAPSHOT]
>         at org.apache.drill.exec.server.Drillbit.close(Drillbit.java:149) ~[drill-java-exec-1.5.0-SNAPSHOT.jar:1.5.0-SNAPSHOT]
>         at org.apache.drill.exec.server.Drillbit$ShutdownThread.run(Drillbit.java:243)
[drill-java-exec-1.5.0-SNAPSHOT.jar:1.5.0-SNAPSHOT]
> Caused by: java.lang.IllegalStateException: allocator[frag:19:0]: buffer space (0) +
prealloc space (0) + child space (111759688) != allocated (111104328)
>         at org.apache.drill.exec.memory.BaseAllocator.verifyAllocator(BaseAllocator.java:653)
~[drill-memory-base-1.5.0-SNAPSHOT.jar:1.5.0-SNAPSHOT]
>         at org.apache.drill.exec.memory.BaseAllocator.verifyAllocator(BaseAllocator.java:557)
~[drill-memory-base-1.5.0-SNAPSHOT.jar:1.5.0-SNAPSHOT]
>         at org.apache.drill.exec.memory.BaseAllocator.verifyAllocator(BaseAllocator.java:528)
~[drill-memory-base-1.5.0-SNAPSHOT.jar:1.5.0-SNAPSHOT]
>         at org.apache.drill.exec.memory.BaseAllocator.close(BaseAllocator.java:419) ~[drill-memory-base-1.5.0-SNAPSHOT.jar:1.5.0-SNAPSHOT]
>         at org.apache.drill.common.DrillAutoCloseables.closeNoChecked(DrillAutoCloseables.java:44)
~[drill-common-1.5.0-SNAPSHOT.jar:1.5.0-SNAPSHOT]
>         ... 5 common frames omitted
> 2016-01-15 00:59:11,708 [Drillbit-ShutdownHook#0] INFO  o.apache.drill.exec.server.Drillbit
- Shutdown completed (9058 ms).
> {code}
> Fragment 19:0 is sort that failed with OOM:
> {code}
> 2016-01-15 00:58:00,552 [2967bc04-43e6-6e2c-e91f-4401daae16cc:frag:19:0] INFO  o.a.d.e.p.i.xsort.ExternalSortBatch
- User Error Occurred
> org.apache.drill.common.exceptions.UserException: RESOURCE ERROR: One or more nodes ran
out of memory while executing the query.
> {code}
> My setup (I'm sure it can be scaled down to one node)
> * 4 node cluster
> * each box is 32 cores
> * 48 GB direct memory
> * 10GB memory allocated to sort
> Drill settings:
> {code}
> 0: jdbc:drill:schema=dfs> select * from sys.options where status like '%CHANGED%';
> +-------------------------------------------+----------+---------+----------+--------------+-------------+-----------+------------+
> |                   name                    |   kind   |  type   |  status  |   num_val
   | string_val  | bool_val  | float_val  |
> +-------------------------------------------+----------+---------+----------+--------------+-------------+-----------+------------+
> | planner.enable_decimal_data_type          | BOOLEAN  | SYSTEM  | CHANGED  | null  
      | null        | true      | null       |
> | planner.enable_hashjoin                   | BOOLEAN  | SYSTEM  | CHANGED  | null  
      | null        | false     | null       |
> | planner.memory.max_query_memory_per_node  | LONG     | SYSTEM  | CHANGED  | 10737418240
 | null        | null      | null       |
> +-------------------------------------------+----------+---------+----------+--------------+-------------+-----------+------------+
> 3 rows selected (3.532 seconds)
> {code}
> Or maybe we don't wait for anything to complete and just shut everything down ?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Mime
View raw message