Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 38382200BC7 for ; Fri, 11 Nov 2016 02:54:03 +0100 (CET) Received: by cust-asf.ponee.io (Postfix) id 3620A160B10; Fri, 11 Nov 2016 01:54:03 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 5470E160B01 for ; Fri, 11 Nov 2016 02:54:02 +0100 (CET) Received: (qmail 29618 invoked by uid 500); 11 Nov 2016 01:54:01 -0000 Mailing-List: contact reviews-help@impala.incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list reviews@impala.incubator.apache.org Received: (qmail 29607 invoked by uid 99); 11 Nov 2016 01:54:01 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd4-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 11 Nov 2016 01:54:01 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd4-us-west.apache.org (ASF Mail Server at spamd4-us-west.apache.org) with ESMTP id C8391C09E8 for ; Fri, 11 Nov 2016 01:54:00 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd4-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 0.362 X-Spam-Level: X-Spam-Status: No, score=0.362 tagged_above=-999 required=6.31 tests=[RDNS_DYNAMIC=0.363, SPF_PASS=-0.001] autolearn=disabled Received: from mx1-lw-us.apache.org ([10.40.0.8]) by localhost (spamd4-us-west.apache.org [10.40.0.11]) (amavisd-new, port 10024) with ESMTP id CA5OGSz2OnEl for ; Fri, 11 Nov 2016 01:53:59 +0000 (UTC) Received: from ip-10-146-233-104.ec2.internal (ec2-75-101-130-251.compute-1.amazonaws.com [75.101.130.251]) by mx1-lw-us.apache.org (ASF Mail Server at mx1-lw-us.apache.org) with ESMTPS id C86E75FB58 for ; Fri, 11 Nov 2016 01:53:58 +0000 (UTC) Received: from localhost (localhost [127.0.0.1]) by ip-10-146-233-104.ec2.internal (8.14.4/8.14.4) with ESMTP id uAB1rwXV031469; Fri, 11 Nov 2016 01:53:58 GMT Message-Id: <201611110153.uAB1rwXV031469@ip-10-146-233-104.ec2.internal> Date: Fri, 11 Nov 2016 01:53:58 +0000 From: "Henry Robinson (Code Review)" To: impala-cr@cloudera.com, reviews@impala.incubator.apache.org CC: Sailesh Mukil , Tim Armstrong Reply-To: henry@cloudera.com X-Gerrit-MessageType: comment Subject: =?UTF-8?Q?=5BImpala-ASF-CR=5D_IMPALA-3882=3A_Simplify_some_query_exec_state_locking=0A?= X-Gerrit-Change-Id: I516357d2b5e9eb83e8209872cbe4c078c778a629 X-Gerrit-ChangeURL: X-Gerrit-Commit: dd5d779b4b4c94de987671686af535503f197cf5 In-Reply-To: References: MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Content-Disposition: inline User-Agent: Gerrit/2.12.2 archived-at: Fri, 11 Nov 2016 01:54:03 -0000 Henry Robinson has posted comments on this change. Change subject: IMPALA-3882: Simplify some query exec state locking ...................................................................... Patch Set 4: (9 comments) Here's how I think about the lock safety (sorry, long, but hopefully helps to convince): 1. For callers that passed 'false' to GetQES(), nothing changes. That code path stays the same. 2. For callers that passed 'true' to GetQES(), there's a new set of possible executions. Previously no two 'true' callers could obtain a reference to the QES simultaneously. That has now changed. However, apart from registration or UnregisterQuery()*, there's no reason I can think of to care about whether there are other references to the QES; rather the concern is whether the QES is accessed safely under that concurrency. Registration and UnregisterQuery() are safe anyhow. Safety under concurrent references to the same QES is managed by QES::lock(), so the important thing in this patch is making sure that all 'true' callers correctly obtain the lock when they need to. Conveniently, that's the same point at which they all used to *adopt* the same lock, so it's pretty easy to check that we've got that right. I added some more defensive locking even to the 'false' callers when they read from the QES in the latest patch version. 3. Previously, no 'true' caller could obtain a reference to the QES simultaneously with planning. That's changed (effectively by passing 'false' in all those GetQES() calls), and planning doesn't hold the lock now anyhow. * One might think that UnregisterQuery() erased the QES from the map while holding both locks, so that 'true' callers couldn't get at the QES after it was removed - guaranteeing that all 'true' callers were finished with the QES - but that was not the case even before this patch. I ran this patch for ~10k queries under the stress test. I'll leave it running. http://gerrit.cloudera.org:8080/#/c/4935/4/be/src/service/impala-beeswax-server.cc File be/src/service/impala-beeswax-server.cc: Line 279: lock_guard l(*exec_state->lock()); > There still is a race between GetQES() and this lock(). What if it the quer I think that race exists currently, and is benign. If you look at https://github.com/apache/incubator-impala/blob/master/be/src/service/impala-server.cc#L904, UnregisterQuery() doesn't do anything to make sure that other references to the exec state have gone away, so it's totally possible for some other caller to have an existing shared_ptr and to read from the QES while it calls Done(). So the question becomes: is it safe to do anything to a QES if Done() might be called concurrently? Depends on the definition of 'safe': * safe == free from concurrent access bugs -> both caller and Done() need to take lock_, which they [should] do. * safe == free from *logic* bugs, where Done() destroys state that caller wants to read -> that's not what Done() does. Mostly Done() fills in some query profile information, and tells the coordinator it can stop executing (and the coordinator doesn't get destroyed otherwise). http://gerrit.cloudera.org:8080/#/c/4935/4/be/src/service/impala-hs2-server.cc File be/src/service/impala-hs2-server.cc: PS4, Line 625: .get() > I don't feel that strongly, but comparisons with NULL for scoped_ptr/unique I don't feel that strongly either, but went with this because a) it's consistent with other usages in this file and b) it's explicit, rather than relying on operator overloading. Line 641: return; > Unneeded return. Done PS4, Line 708: shared_ptr exec_state = GetQueryExecState(query_id); > If this line is moved above L701, we wouldn't need to call GetQueryExecStat Done http://gerrit.cloudera.org:8080/#/c/4935/4/be/src/service/impala-server.cc File be/src/service/impala-server.cc: Line 789: RETURN_IF_ERROR((*exec_state)->ExecQuery()); > What happens if Cancel() was called before we reach here? Do you mean this comment to be on line 785? Cancel() could happen before this because the query is 'visible' after RegisterQuery() returns, and anyone that looks at e.g. the debug web page can find out the query id. But there is a new window for cancellation between RegisterQuery() and PlanQuery(). PlanQuery() has code to deal with that. Let me know if I'm not seeing the same race that you are. http://gerrit.cloudera.org:8080/#/c/4935/4/be/src/service/impala-server.h File be/src/service/impala-server.h: Line 82: /// 2. session_state_map_lock_ > Curious - do we have a similar problem with the session map lock? Similar-ish, but not as acute. We don't have the problem where we have this weird lock-inversion like we do when calling GetQES(..., true). GetSessionState(). But there is one place where we get s_s_m_l_ and then the s_s_l_, and other places where we don't always get the s_s_m_l_ first. This can lead to deadlock if the latter case then calls GetSessionState() with the session ID for the session it already has the lock for. I think it would be a good idea to move s_s_m_l_ to the bottom of this hierarchy as well by ensuring that it should never be taken in conjunction with any locks. Any iteration over the set of sessions can be done by copying out the shared_ptrs, releasing the lock and then iterating again. http://gerrit.cloudera.org:8080/#/c/4935/4/be/src/service/query-exec-state.cc File be/src/service/query-exec-state.cc: PS4, Line 586: Like the other Exec(), fill out as much profile information as we're able to. > Update comment. Done PS4, Line 594: SetPlanningFinished(); > Why call SetPlanningFinished() here? Because this is effectively the 'planning' path for metadata ops, some of which have result sets and therefore metadata as well. I changed the method name to SetMetadataAvailable() to be clearer. I'm in two minds about whether it's a good change or not. Line 620: block_on_wait_cv_.wait(l); > Should we switch to promises instead of cond vars here? I think that's a good idea (see the jira for this patch), but it's a separate change to simplify the logic a bit. -- To view, visit http://gerrit.cloudera.org:8080/4935 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I516357d2b5e9eb83e8209872cbe4c078c778a629 Gerrit-PatchSet: 4 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Henry Robinson Gerrit-Reviewer: Henry Robinson Gerrit-Reviewer: Sailesh Mukil Gerrit-Reviewer: Tim Armstrong Gerrit-HasComments: Yes