impala-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Michael Ho (JIRA)" <j...@apache.org>
Subject [jira] [Resolved] (IMPALA-5419) PhjBuilder::Partition::InsertBatch () continue to make progress even after query cancellation
Date Fri, 02 Jun 2017 17:32:04 GMT

     [ https://issues.apache.org/jira/browse/IMPALA-5419?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Michael Ho resolved IMPALA-5419.
--------------------------------
       Resolution: Fixed
    Fix Version/s: Impala 2.9.0

https://github.com/apache/incubator-impala/commit/cde19ab8c7801436070ce0438e28d5042265dfd1

IMPALA-5419: Check for cancellation when building hash tables
When inserting build rows into the hash table for each partition
in PhjBuilder::Partition::BuildHashTable(), we didn't check for
cancellation. This may lead to orphaned fragments running for
excessive amount of time and consuming a lot of memory after the
query has been cancelled.

This change fixes the problem by checking for cancellation per
row batch in the loop inside PhjBuilder::Partition::BuildHashTable().

Manually verified with the following query that cancelling the query
right after all build rows have been returned by the scan node will
not cause memory usage to continue going up.

select /* +straight_join */ count(*)
from tpch20_parquet.lineitem t1
  join /* +broadcast */ tpch20_parquet.lineitem t2
  on t1.l_orderkey = t2.l_orderkey

Change-Id: I8047f532f55dc0118f7a843c91275f752c8a190d
Reviewed-on: http://gerrit.cloudera.org:8080/7047
Reviewed-by: Sailesh Mukil <sailesh@cloudera.com>
Reviewed-by: Dan Hecht <dhecht@cloudera.com>
Tested-by: Impala Public Jenkins

> PhjBuilder::Partition::InsertBatch () continue to make progress even after query cancellation
> ---------------------------------------------------------------------------------------------
>
>                 Key: IMPALA-5419
>                 URL: https://issues.apache.org/jira/browse/IMPALA-5419
>             Project: IMPALA
>          Issue Type: Bug
>          Components: Backend
>    Affects Versions: Impala 2.3.0, Impala 2.5.0, Impala 2.4.0, Impala 2.6.0, Impala 2.7.0,
Impala 2.8.0, Impala 2.9.0
>            Reporter: Mostafa Mokhtar
>            Assignee: Michael Ho
>            Priority: Critical
>              Labels: bugbash-2017-05-31, fault-tolerance, resource-management
>             Fix For: Impala 2.9.0
>
>
> Canceling query while remote fragments are in PhjBuilder::Partition::InsertBatch kills
the query but some of the remote fragments continue to spin indefinitely and doesn't release
memory back. 
> stack for stuck thread
> {code}
> Thread 1 (process 26668):
> #0  0x00007f8f0a138500 in impala::PhjBuilder::Partition::InsertBatch ()
> #1  0x0000000000d37f72 in impala::PhjBuilder::Partition::BuildHashTable(bool*) ()
> #2  0x0000000000d38d65 in impala::PhjBuilder::BuildHashTablesAndPrepareProbeStreams()
()
> #3  0x0000000000d392a0 in impala::PhjBuilder::FlushFinal(impala::RuntimeState*) ()
> #4  0x0000000000d20447 in impala::Status impala::BlockingJoinNode::SendBuildInputToSink<true>(impala::RuntimeState*,
impala::DataSink*) ()
> #5  0x0000000000d1ed5f in impala::BlockingJoinNode::ProcessBuildInputAsync(impala::RuntimeState*,
impala::DataSink*, impala::Status*) ()
> #6  0x0000000000bec7d9 in impala::Thread::SuperviseThread(std::basic_string<char,
std::char_traits<char>, std::allocator<char> > const&, std::basic_string<char,
std::char_traits<char>, std::allocator<char> > const&, boost::function<void
()()>, impala::Promise<long>*) ()
> #7  0x0000000000bed194 in boost::detail::thread_data<boost::_bi::bind_t<void, void
(*)(std::basic_string<char, std::char_traits<char>, std::allocator<char> >
const&, std::basic_string<char, std::char_traits<char>, std::allocator<char>
> const&, boost::function<void ()()>, impala::Promise<long>*), boost::_bi::list4<boost::_bi::value<std::basic_string<char,
std::char_traits<char>, std::allocator<char> > >, boost::_bi::value<std::basic_string<char,
std::char_traits<char>, std::allocator<char> > >, boost::_bi::value<boost::function<void
()()> >, boost::_bi::value<impala::Promise<long>*> > > >::run()
()
> #8  0x0000000000e540ea in ?? ()
> #9  0x000000316da07aa1 in start_thread () from /lib64/libpthread.so.0
> #10 0x000000316d6e893d in clone () from /lib64/libc.so.6
> {code}
> Query
> {code}
> SELECT /* +straight_join */ count(*)
> FROM lineitem
> JOIN /* +broadcast */ supplier ON l_suppkey = s_suppkey
> JOIN /* +broadcast */ part ON l_partkey = p_partkey
> JOIN /* +broadcast */ orders ON l_orderkey = o_orderkey
> WHERE s_name='Supplier#001880004'
>   AND p_brand < 'Brand#30'
>   AND o_orderdate < '1994-01-01'
> {code}
> Snapshot from memz page
> {code}
> Process: Limit=201.73 GB Total=148.77 GB Peak=182.63 GB
>   Free Disk IO Buffers: Total=1.88 GB Peak=1.88 GB
>   RequestPool=root.default: Total=133.39 GB Peak=167.99 GB
>     Query(d84d4f69e2366d60:72d5ad1800000000): Total=133.39 GB Peak=167.99 GB
>       Fragment d84d4f69e2366d60:72d5ad18000001d1: Total=39.18 GB Peak=163.04 GB
>         Runtime Filter Bank: Total=4.01 MB Peak=4.01 MB
>         AGGREGATION_NODE (id=7): Total=40.00 KB Peak=8.51 MB
>           Exprs: Total=4.00 KB Peak=4.00 KB
>         HASH_JOIN_NODE (id=6): Total=39.17 GB Peak=39.17 GB
>           Hash Join Builder (join_node_id=6): Total=39.17 GB Peak=39.17 GB
>         HASH_JOIN_NODE (id=5): Total=50.25 KB Peak=129.25 GB
>           Hash Join Builder (join_node_id=5): Total=13.12 KB Peak=129.24 GB
>         HASH_JOIN_NODE (id=4): Total=42.25 KB Peak=1.05 MB
>           Hash Join Builder (join_node_id=4): Total=13.12 KB Peak=1.01 MB
>         HDFS_SCAN_NODE (id=0): Total=0 Peak=2.24 GB
>         EXCHANGE_NODE (id=8): Total=0 Peak=0
>         DataStreamRecvr: Total=0 Peak=4.01 KB
>         EXCHANGE_NODE (id=9): Total=0 Peak=0
>         DataStreamRecvr: Total=0 Peak=29.98 MB
>         EXCHANGE_NODE (id=10): Total=0 Peak=0
>         DataStreamRecvr: Total=0 Peak=16.39 MB
>         DataStreamSender (dst_id=11): Total=16.00 KB Peak=16.00 KB
>         CodeGen: Total=22.08 KB Peak=3.56 MB
>       Block Manager: Limit=161.39 GB Total=133.38 GB Peak=161.38 GB
>     Query(73440af14cf09128:21f5ed1400000000): Total=31.27 KB Peak=216.54 MB
>       Fragment 73440af14cf09128:21f5ed140000003a: Total=31.27 KB Peak=216.54 MB
>         AGGREGATION_NODE (id=1): Total=8.00 KB Peak=1.89 MB
>           Exprs: Total=4.00 KB Peak=4.00 KB
>         HDFS_SCAN_NODE (id=0): Total=0 Peak=216.45 MB
>         DataStreamSender (dst_id=2): Total=14.22 KB Peak=14.22 KB
>         CodeGen: Total=1.05 KB Peak=188.00 KB
>       Block Manager: Limit=161.39 GB Total=0 Peak=0
>   Untracked Memory: Total=13.50 GB
> {code}
> From the impalad.INFO log
> {code}
> [mmokhtar@va1026 impalad]$ cat  impalad.va1026.foo.net.impala.log.INFO.20170531-170204.96499
 | grep d84d4f69e2366d60:72d5ad1800000000 | head -n 500
> I0601 00:45:10.883314 96906 admission-controller.cc:442] Schedule for id=d84d4f69e2366d60:72d5ad1800000000
in pool_name=root.default cluster_mem_needed=17179.57 GB PoolConfig: max_requests=100 max_queued=200
max_mem=-1.00 B
> I0601 00:45:10.891855 96906 admission-controller.cc:453] Admitted query id=d84d4f69e2366d60:72d5ad1800000000
> I0601 00:45:10.903158 96906 coordinator.cc:121] Exec() query_id=d84d4f69e2366d60:72d5ad1800000000
stmt=-- Description : Selective broadcast joins between lineitem, supplier, part and orders.
> I0601 00:45:10.935317 96906 coordinator.cc:383] starting execution on 130 backends for
query d84d4f69e2366d60:72d5ad1800000000
> I0601 00:45:10.996497 135837 impala-internal-service.cc:44] ExecQueryFInstances(): query_id=d84d4f69e2366d60:72d5ad1800000000
> I0601 00:45:10.996796 135837 query-exec-mgr.cc:46] StartQueryFInstances() query_id=d84d4f69e2366d60:72d5ad1800000000
coord=va1026.foo.net:22000
> I0601 00:45:10.997879 58950 query-state.cc:263] StartFInstances(): query_id=d84d4f69e2366d60:72d5ad1800000000
#instances=1
> I0601 00:45:11.011160 58950 query-state.cc:276] descriptor table for query=d84d4f69e2366d60:72d5ad1800000000
> I0601 00:45:11.012342 58951 query-state.cc:325] Executing instance. instance_id=d84d4f69e2366d60:72d5ad1800000000
fragment_idx=0 per_fragment_instance_idx=0 coord_state_idx=0 #in-flight=1
> I0601 00:45:11.034893 58950 query-exec-mgr.cc:138] ReleaseQueryState(): query_id=d84d4f69e2366d60:72d5ad1800000000
refcnt=3
> I0601 00:45:12.325824 96906 coordinator.cc:396] started execution on 130 backends for
query d84d4f69e2366d60:72d5ad1800000000
> I0601 00:45:12.353492 96906 impala-server.cc:926] Query d84d4f69e2366d60:72d5ad1800000000
has timeout of 10m
> I0601 02:42:08.973990 115288 coordinator.cc:983] query_id=d84d4f69e2366d60:72d5ad1800000000:
first in-progress backend: va1026.foo.net:22000
> I0601 02:42:09.172730 11880 coordinator.cc:983] query_id=d84d4f69e2366d60:72d5ad1800000000:
first in-progress backend: va1026.foo.net:22000
> I0601 02:42:09.579278 127412 coordinator.cc:983] query_id=d84d4f69e2366d60:72d5ad1800000000:
first in-progress backend: va1026.foo.net:22000
> I0601 02:42:11.640924 112391 coordinator.cc:983] query_id=d84d4f69e2366d60:72d5ad1800000000:
first in-progress backend: va1026.foo.net:22000
> I0601 02:42:19.061241 115418 coordinator.cc:983] query_id=d84d4f69e2366d60:72d5ad1800000000:
first in-progress backend: va1026.foo.net:22000
> I0601 02:42:19.554117 112209 coordinator.cc:983] query_id=d84d4f69e2366d60:72d5ad1800000000:
first in-progress backend: va1026.foo.net:22000
> I0601 02:42:29.405820 127414 coordinator.cc:983] query_id=d84d4f69e2366d60:72d5ad1800000000:
first in-progress backend: va1026.foo.net:22000
> I0601 02:42:30.752419 145416 coordinator.cc:983] query_id=d84d4f69e2366d60:72d5ad1800000000:
first in-progress backend: va1026.foo.net:22000
> I0601 02:42:36.649454 111018 coordinator.cc:983] query_id=d84d4f69e2366d60:72d5ad1800000000:
first in-progress backend: va1026.foo.net:22000
> I0601 02:42:42.374414 123586 coordinator.cc:983] query_id=d84d4f69e2366d60:72d5ad1800000000:
first in-progress backend: va1026.foo.net:22000
> I0601 02:42:47.766850 143075 coordinator.cc:983] query_id=d84d4f69e2366d60:72d5ad1800000000:
first in-progress backend: va1026.foo.net:22000
> I0601 02:42:56.262635 11526 coordinator.cc:983] query_id=d84d4f69e2366d60:72d5ad1800000000:
first in-progress backend: va1026.foo.net:22000
> I0601 02:43:00.584983 132521 coordinator.cc:983] query_id=d84d4f69e2366d60:72d5ad1800000000:
first in-progress backend: va1026.foo.net:22000
> I0601 02:43:40.918494 132978 coordinator.cc:983] query_id=d84d4f69e2366d60:72d5ad1800000000:
first in-progress backend: va1026.foo.net:22000
> I0601 08:34:45.016852 106703 impala-server.cc:937] UnregisterQuery(): query_id=d84d4f69e2366d60:72d5ad1800000000
> I0601 08:34:45.017141 106703 impala-server.cc:1022] Cancel(): query_id=d84d4f69e2366d60:72d5ad1800000000
> I0601 08:34:45.248535 106703 coordinator.cc:920] Cancel() query_id=d84d4f69e2366d60:72d5ad1800000000
> I0601 08:34:45.248984 106703 coordinator-backend-state.cc:332] sending CancelQueryFInstances
rpc for query_id=d84d4f69e2366d60:72d5ad1800000000 backend=va1026.foo.net:22000
> I0601 08:34:45.249454 120728 impala-internal-service.cc:63] CancelQueryFInstances():
query_id=d84d4f69e2366d60:72d5ad1800000000
> I0601 08:34:45.249811 120728 query-exec-mgr.cc:83] QueryState: query_id=d84d4f69e2366d60:72d5ad1800000000
refcnt=3
> I0601 08:34:45.250104 120728 query-state.cc:342] Cancel: query_id=d84d4f69e2366d60:72d5ad1800000000
> I0601 08:34:45.250383 120728 data-stream-mgr.cc:258] cancelling all streams for fragment=d84d4f69e2366d60:72d5ad1800000000
> I0601 08:34:45.250669 120728 data-stream-recvr.cc:237] cancelled stream: fragment_instance_id_=d84d4f69e2366d60:72d5ad1800000000
node_id=11
> I0601 08:34:45.250993 120728 query-exec-mgr.cc:138] ReleaseQueryState(): query_id=d84d4f69e2366d60:72d5ad1800000000
refcnt=3
> I0601 08:34:45.251452 106703 coordinator-backend-state.cc:332] sending CancelQueryFInstances
rpc for query_id=d84d4f69e2366d60:72d5ad1800000000 backend=va1029.foo.net:22000
> I0601 08:34:45.253317 58951 data-stream-mgr.cc:226] DeregisterRecvr(): fragment_instance_id=d84d4f69e2366d60:72d5ad1800000000,
node=11
> I0601 08:34:45.255867 106703 coordinator-backend-state.cc:332] sending CancelQueryFInstances
rpc for query_id=d84d4f69e2366d60:72d5ad1800000000 backend=va1030.foo.net:22000
> I0601 08:34:45.266269 106703 coordinator-backend-state.cc:332] sending CancelQueryFInstances
rpc for query_id=d84d4f69e2366d60:72d5ad1800000000 backend=va1031.foo.net:22000
> I0601 08:34:45.273082 58951 query-state.cc:332] Instance completed. instance_id=d84d4f69e2366d60:72d5ad1800000000
#in-flight=0 status=CANCELLED: Cancelled
> I0601 08:34:45.273423 58951 query-state.cc:342] Cancel: query_id=d84d4f69e2366d60:72d5ad1800000000
> I0601 08:34:45.273550 106703 coordinator-backend-state.cc:332] sending CancelQueryFInstances
rpc for query_id=d84d4f69e2366d60:72d5ad1800000000 backend=va1032.foo.net:22000
> I0601 08:34:45.273679 58951 query-exec-mgr.cc:138] ReleaseQueryState(): query_id=d84d4f69e2366d60:72d5ad1800000000
refcnt=2
> I0601 08:34:45.278386 106703 coordinator-backend-state.cc:332] sending CancelQueryFInstances
rpc for query_id=d84d4f69e2366d60:72d5ad1800000000 backend=va1033.foo.net:22000
> I0601 08:34:45.283445 106703 coordinator-backend-state.cc:332] sending CancelQueryFInstances
rpc for query_id=d84d4f69e2366d60:72d5ad1800000000 backend=va1034.foo.net:22000
> I0601 08:34:45.291430 106703 coordinator-backend-state.cc:332] sending CancelQueryFInstances
rpc for query_id=d84d4f69e2366d60:72d5ad1800000000 backend=va1035.foo.net:22000
> I0601 08:34:45.296653 106703 coordinator-backend-state.cc:332] sending CancelQueryFInstances
rpc for query_id=d84d4f69e2366d60:72d5ad1800000000 backend=va1036.foo.net:22000
> I0601 08:34:45.301268 106703 coordinator-backend-state.cc:332] sending CancelQueryFInstances
rpc for query_id=d84d4f69e2366d60:72d5ad1800000000 backend=va1037.foo.net:22000
> I0601 08:34:45.306454 106703 coordinator-backend-state.cc:332] sending CancelQueryFInstances
rpc for query_id=d84d4f69e2366d60:72d5ad1800000000 backend=va1038.foo.net:22000
> I0601 08:34:45.309927 106703 coordinator-backend-state.cc:332] sending CancelQueryFInstances
rpc for query_id=d84d4f69e2366d60:72d5ad1800000000 backend=va1039.foo.net:22000
> I0601 08:34:45.315886 106703 coordinator-backend-state.cc:332] sending CancelQueryFInstances
rpc for query_id=d84d4f69e2366d60:72d5ad1800000000 backend=va1041.foo.net:22000
> I0601 08:34:45.321213 106703 coordinator-backend-state.cc:332] sending CancelQueryFInstances
rpc for query_id=d84d4f69e2366d60:72d5ad1800000000 backend=va1042.foo.net:22000
> I0601 08:34:45.328903 106703 coordinator-backend-state.cc:332] sending CancelQueryFInstances
rpc for query_id=d84d4f69e2366d60:72d5ad1800000000 backend=va1043.foo.net:22000
> I0601 08:34:45.333417 106703 coordinator-backend-state.cc:332] sending CancelQueryFInstances
rpc for query_id=d84d4f69e2366d60:72d5ad1800000000 backend=va1045.foo.net:22000
> I0601 08:34:45.338429 106703 coordinator-backend-state.cc:332] sending CancelQueryFInstances
rpc for query_id=d84d4f69e2366d60:72d5ad1800000000 backend=va1046.foo.net:22000
> I0601 08:34:45.344604 106703 coordinator-backend-state.cc:332] sending CancelQueryFInstances
rpc for query_id=d84d4f69e2366d60:72d5ad1800000000 backend=vc1501.foo.net:22000
> I0601 08:34:45.348870 106703 coordinator-backend-state.cc:332] sending CancelQueryFInstances
rpc for query_id=d84d4f69e2366d60:72d5ad1800000000 backend=vc1502.foo.net:22000
> I0601 08:34:45.355267 106703 coordinator-backend-state.cc:332] sending CancelQueryFInstances
rpc for query_id=d84d4f69e2366d60:72d5ad1800000000 backend=vc1503.foo.net:22000
> I0601 08:34:45.360085 106703 coordinator-backend-state.cc:332] sending CancelQueryFInstances
rpc for query_id=d84d4f69e2366d60:72d5ad1800000000 backend=vc1504.foo.net:22000
> I0601 08:34:45.365375 106703 coordinator-backend-state.cc:332] sending CancelQueryFInstances
rpc for query_id=d84d4f69e2366d60:72d5ad1800000000 backend=vc1505.foo.net:22000
> I0601 08:34:45.371361 106703 coordinator-backend-state.cc:332] sending CancelQueryFInstances
rpc for query_id=d84d4f69e2366d60:72d5ad1800000000 backend=vc1507.foo.net:22000
> I0601 08:34:45.380437 106703 coordinator-backend-state.cc:332] sending CancelQueryFInstances
rpc for query_id=d84d4f69e2366d60:72d5ad1800000000 backend=vc1508.foo.net:22000
> I0601 08:34:45.384851 106703 coordinator-backend-state.cc:332] sending CancelQueryFInstances
rpc for query_id=d84d4f69e2366d60:72d5ad1800000000 backend=vc1509.foo.net:22000
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Mime
View raw message