drill-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Paul Rogers (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (DRILL-5513) Managed External Sort : OOM error during the merge phase
Date Mon, 19 Jun 2017 07:32:00 GMT

    [ https://issues.apache.org/jira/browse/DRILL-5513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16053554#comment-16053554
] 

Paul Rogers commented on DRILL-5513:
------------------------------------

The problem now is that the amount of memory needed to read a spilled batch is significantly
larger than the amount of memory the batch consumed when written. From the (revised) logs:

{code}
ExternalSortBatch - Merge batch size: net = 6666666 bytes,
   gross = 8888888 bytes, records = 35460;
   spill file size: 268435456 bytes
...
PriorityQueueCopierWrapper - Merged 35460 records,
   consuming 7536640 bytes of memory
...
BatchGroup - Read 35460 records; size = 9437184
{code}

The sort decided a spill batch should contain 6,666,666 bytes of data. It assumed vectors
are 75% full (25% internal fragmentation) for a memory target of 8,888,888 allocated bytes.

Actual spill was close: actual allocated memory was 7,536,640 bytes.

But, then the assumptions fell apart on re-reading this same data. The read needed 9,437,184
bytes of memory. Since this is much more than was expected, we got whacked by the operator
memory allocator killing us due to an excessive allocation. (Sigh... Is this really a constructive
use of time?)

Furthermore, since we have only 20 MB of memory, rereading two spilled batches consumes 18,874,368
bytes, leaving only 1,125,632 to hold an output batch that is supposed to be 6,666,668 bytes
in size (net) or 8,888,888 bytes gross.

Have to come up with some way to anticipate the unexpected excessive memory use and reduce
the spill batch size accordingly. (Or, have to track down the reason for the excessive use
and fix it...)

> Managed External Sort : OOM error during the merge phase
> --------------------------------------------------------
>
>                 Key: DRILL-5513
>                 URL: https://issues.apache.org/jira/browse/DRILL-5513
>             Project: Apache Drill
>          Issue Type: Bug
>    Affects Versions: 1.10.0
>            Reporter: Rahul Challapalli
>            Assignee: Paul Rogers
>         Attachments: 26e5f7b8-71e8-afca-e72e-fad7be2b2416.sys.drill, drillbit.log
>
>
> git.commit.id.abbrev=1e0a14c
> No of nodes in cluster : 1
> DRILL_MAX_DIRECT_MEMORY="32G"
> DRILL_MAX_HEAP="4G"
> The below query fails with an OOM
> {code}
> ALTER SESSION SET `exec.sort.disable_managed` = false;
> alter session set `planner.width.max_per_query` = 100;
> alter session set `planner.memory.max_query_memory_per_node` = 652428800;
> select count(*) from (select s1.type type, flatten(s1.rms.rptd) rptds from (select d.type
type, d.uid uid, flatten(d.map.rm) rms from dfs.`/drill/testdata/resource-manager/nested-large.json`
d order by d.uid) s1 order by s1.rms.mapid);
> {code}
> Exception from the logs
> {code}
> 2017-05-15 12:58:46,646 [BitServer-4] DEBUG o.a.drill.exec.work.foreman.Foreman - 26e5f7b8-71e8-afca-e72e-fad7be2b2416:
State change requested RUNNING --> FAILED
> org.apache.drill.common.exceptions.UserRemoteException: RESOURCE ERROR: One or more nodes
ran out of memory while executing the query.
> Unable to allocate buffer of size 2097152 due to memory limit. Current allocation: 19791880
> Fragment 5:2
> [Error Id: bb67176f-a780-400d-88c9-06fea131ea64 on qa-node190.qa.lab:31010]
>   (org.apache.drill.exec.exception.OutOfMemoryException) Unable to allocate buffer of
size 2097152 due to memory limit. Current allocation: 19791880
>     org.apache.drill.exec.memory.BaseAllocator.buffer():220
>     org.apache.drill.exec.memory.BaseAllocator.buffer():195
>     org.apache.drill.exec.vector.BigIntVector.reAlloc():212
>     org.apache.drill.exec.vector.BigIntVector.copyFromSafe():324
>     org.apache.drill.exec.vector.NullableBigIntVector.copyFromSafe():367
>     org.apache.drill.exec.vector.NullableBigIntVector$TransferImpl.copyValueSafe():328
>     org.apache.drill.exec.vector.complex.RepeatedMapVector$RepeatedMapTransferPair.copyValueSafe():360
>     org.apache.drill.exec.vector.complex.MapVector$MapTransferPair.copyValueSafe():220
>     org.apache.drill.exec.vector.complex.MapVector.copyFromSafe():82
>     org.apache.drill.exec.test.generated.PriorityQueueCopierGen4494.doCopy():34
>     org.apache.drill.exec.test.generated.PriorityQueueCopierGen4494.next():76
>     org.apache.drill.exec.physical.impl.xsort.managed.CopierHolder$BatchMerger.next():234
>     org.apache.drill.exec.physical.impl.xsort.managed.ExternalSortBatch.mergeSpilledRuns():1214
>     org.apache.drill.exec.physical.impl.xsort.managed.ExternalSortBatch.load():689
>     org.apache.drill.exec.physical.impl.xsort.managed.ExternalSortBatch.innerNext():559
>     org.apache.drill.exec.record.AbstractRecordBatch.next():162
>     org.apache.drill.exec.physical.impl.validate.IteratorValidatorBatchIterator.next():215
>     org.apache.drill.exec.record.AbstractRecordBatch.next():119
>     org.apache.drill.exec.record.AbstractRecordBatch.next():109
>     org.apache.drill.exec.record.AbstractSingleRecordBatch.innerNext():51
>     org.apache.drill.exec.physical.impl.svremover.RemovingRecordBatch.innerNext():93
>     org.apache.drill.exec.record.AbstractRecordBatch.next():162
>     org.apache.drill.exec.physical.impl.validate.IteratorValidatorBatchIterator.next():215
>     org.apache.drill.exec.physical.impl.BaseRootExec.next():104
>     org.apache.drill.exec.physical.impl.SingleSenderCreator$SingleSenderRootExec.innerNext():92
>     org.apache.drill.exec.physical.impl.BaseRootExec.next():94
>     org.apache.drill.exec.work.fragment.FragmentExecutor$1.run():234
>     org.apache.drill.exec.work.fragment.FragmentExecutor$1.run():227
>     java.security.AccessController.doPrivileged():-2
>     javax.security.auth.Subject.doAs():415
>     org.apache.hadoop.security.UserGroupInformation.doAs():1595
>     org.apache.drill.exec.work.fragment.FragmentExecutor.run():227
>     org.apache.drill.common.SelfCleaningRunnable.run():38
>     java.util.concurrent.ThreadPoolExecutor.runWorker():1145
>     java.util.concurrent.ThreadPoolExecutor$Worker.run():615
>     java.lang.Thread.run():745
>         at org.apache.drill.exec.work.foreman.QueryManager$1.statusUpdate(QueryManager.java:537)
[drill-java-exec-1.11.0-SNAPSHOT.jar:1.11.0-SNAPSHOT]
>         at org.apache.drill.exec.rpc.control.WorkEventBus.statusUpdate(WorkEventBus.java:71)
[drill-java-exec-1.11.0-SNAPSHOT.jar:1.11.0-SNAPSHOT]
>         at org.apache.drill.exec.work.batch.ControlMessageHandler.handle(ControlMessageHandler.java:94)
[drill-java-exec-1.11.0-SNAPSHOT.jar:1.11.0-SNAPSHOT]
>         at org.apache.drill.exec.work.batch.ControlMessageHandler.handle(ControlMessageHandler.java:55)
[drill-java-exec-1.11.0-SNAPSHOT.jar:1.11.0-SNAPSHOT]
>         at org.apache.drill.exec.rpc.BasicServer.handle(BasicServer.java:159) [drill-rpc-1.11.0-SNAPSHOT.jar:1.11.0-SNAPSHOT]
>         at org.apache.drill.exec.rpc.BasicServer.handle(BasicServer.java:53) [drill-rpc-1.11.0-SNAPSHOT.jar:1.11.0-SNAPSHOT]
>         at org.apache.drill.exec.rpc.RpcBus$InboundHandler.decode(RpcBus.java:274) [drill-rpc-1.11.0-SNAPSHOT.jar:1.11.0-SNAPSHOT]
>         at org.apache.drill.exec.rpc.RpcBus$InboundHandler.decode(RpcBus.java:244) [drill-rpc-1.11.0-SNAPSHOT.jar:1.11.0-SNAPSHOT]
>         at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:89)
[netty-codec-4.0.27.Final.jar:4.0.27.Final]
>         at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:339)
[netty-transport-4.0.27.Final.jar:4.0.27.Final]
>         at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:324)
[netty-transport-4.0.27.Final.jar:4.0.27.Final]
>         at io.netty.handler.timeout.ReadTimeoutHandler.channelRead(ReadTimeoutHandler.java:150)
[netty-handler-4.0.27.Final.jar:4.0.27.Final]
>         at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:339)
[netty-transport-4.0.27.Final.jar:4.0.27.Final]
>         at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:324)
[netty-transport-4.0.27.Final.jar:4.0.27.Final]
>         at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103)
[netty-codec-4.0.27.Final.jar:4.0.27.Final]
>         at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:339)
[netty-transport-4.0.27.Final.jar:4.0.27.Final]
>         at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:324)
[netty-transport-4.0.27.Final.jar:4.0.27.Final]
>         at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:242)
[netty-codec-4.0.27.Final.jar:4.0.27.Final]
>         at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:339)
[netty-transport-4.0.27.Final.jar:4.0.27.Final]
>         at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:324)
[netty-transport-4.0.27.Final.jar:4.0.27.Final]
>         at io.netty.channel.ChannelInboundHandlerAdapter.channelRead(ChannelInboundHandlerAdapter.java:86)
[netty-transport-4.0.27.Final.jar:4.0.27.Final]
>         at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:339)
[netty-transport-4.0.27.Final.jar:4.0.27.Final]
>         at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:324)
[netty-transport-4.0.27.Final.jar:4.0.27.Final]
>         at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:847)
[netty-transport-4.0.27.Final.jar:4.0.27.Final]
>         at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:131)
[netty-transport-4.0.27.Final.jar:4.0.27.Final]
>         at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511)
[netty-transport-4.0.27.Final.jar:4.0.27.Final]
>         at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
[netty-transport-4.0.27.Final.jar:4.0.27.Final]
>         at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
[netty-transport-4.0.27.Final.jar:4.0.27.Final]
>         at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354) [netty-transport-4.0.27.Final.jar:4.0.27.Final]
>         at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
[netty-common-4.0.27.Final.jar:4.0.27.Final]
>         at java.lang.Thread.run(Thread.java:745) [na:1.7.0_111]
> {code}
> Attached the log and profile files. The dataset is not attached here as it is larger
than the permitted size



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Mime
View raw message