impala-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Henry Robinson (Code Review)" <ger...@cloudera.org>
Subject [Impala-ASF-CR] IMPALA-3882: Simplify some query exec state locking
Date Fri, 11 Nov 2016 01:53:58 GMT
Henry Robinson has posted comments on this change.

Change subject: IMPALA-3882: Simplify some query exec state locking
......................................................................


Patch Set 4:

(9 comments)

Here's how I think about the lock safety (sorry, long, but hopefully helps to convince):

1. For callers that passed 'false' to GetQES(), nothing changes. That code path stays the
same. 

2. For callers that passed 'true' to GetQES(), there's a new set of possible executions. Previously
no two 'true' callers could obtain a reference to the QES simultaneously. That has now changed.


However, apart from registration or UnregisterQuery()*, there's no reason I can think of to
care about whether there are other references to the QES; rather the concern is whether the
QES is accessed safely under that concurrency. Registration and UnregisterQuery() are safe
anyhow.

Safety under concurrent references to the same QES is managed by QES::lock(), so the important
thing in this patch is making sure that all 'true' callers correctly obtain the lock when
they need to. Conveniently, that's the same point at which they all used to *adopt* the same
lock, so it's pretty easy to check that we've got that right.

I added some more defensive locking even to the 'false' callers when they read from the QES
in the latest patch version. 

3. Previously, no 'true' caller could obtain a reference to the QES simultaneously with planning.
That's changed (effectively by passing 'false' in all those GetQES() calls), and planning
doesn't hold the lock now anyhow.

* One might think that UnregisterQuery() erased the QES from the map while holding both locks,
so that 'true' callers couldn't get at the QES after it was removed - guaranteeing that all
'true' callers were finished with the QES - but that was not the case even before this patch.

I ran this patch for ~10k queries under the stress test. I'll leave it running.

http://gerrit.cloudera.org:8080/#/c/4935/4/be/src/service/impala-beeswax-server.cc
File be/src/service/impala-beeswax-server.cc:

Line 279:   lock_guard<mutex> l(*exec_state->lock());
> There still is a race between GetQES() and this lock(). What if it the quer
I think that race exists currently, and is benign. If you look at https://github.com/apache/incubator-impala/blob/master/be/src/service/impala-server.cc#L904,
UnregisterQuery() doesn't do anything to make sure that other references to the exec state
have gone away, so it's totally possible for some other caller to have an existing shared_ptr
and to read from the QES while it calls Done().

So the question becomes: is it safe to do anything to a QES if Done() might be called concurrently?
Depends on the definition of 'safe':

* safe == free from concurrent access bugs -> both caller and Done() need to take lock_,
which they [should] do.
* safe == free from *logic* bugs, where Done() destroys state that caller wants to read ->
that's not what Done() does. Mostly Done() fills in some query profile information, and tells
the coordinator it can stop executing (and the coordinator doesn't get destroyed otherwise).


http://gerrit.cloudera.org:8080/#/c/4935/4/be/src/service/impala-hs2-server.cc
File be/src/service/impala-hs2-server.cc:

PS4, Line 625: .get()
> I don't feel that strongly, but comparisons with NULL for scoped_ptr/unique
I don't feel that strongly either, but went with this because a) it's consistent with other
usages in this file and b) it's explicit, rather than relying on operator overloading.


Line 641:   return;
> Unneeded return.
Done


PS4, Line 708: shared_ptr<QueryExecState> exec_state = GetQueryExecState(query_id);
> If this line is moved above L701, we wouldn't need to call GetQueryExecStat
Done


http://gerrit.cloudera.org:8080/#/c/4935/4/be/src/service/impala-server.cc
File be/src/service/impala-server.cc:

Line 789:   RETURN_IF_ERROR((*exec_state)->ExecQuery());
> What happens if Cancel() was called before we reach here?
Do you mean this comment to be on line 785? Cancel() could happen before this because the
query is 'visible' after RegisterQuery() returns, and anyone that looks at e.g. the debug
web page can find out the query id. 

But there is a new window for cancellation between RegisterQuery() and PlanQuery(). PlanQuery()
has code to deal with that. Let me know if I'm not seeing the same race that you are.


http://gerrit.cloudera.org:8080/#/c/4935/4/be/src/service/impala-server.h
File be/src/service/impala-server.h:

Line 82: /// 2. session_state_map_lock_
> Curious - do we have a similar problem with the session map lock?
Similar-ish, but not as acute. We don't have the problem where we have this weird lock-inversion
like we do when calling GetQES(..., true). GetSessionState(). But there is one place where
we get s_s_m_l_ and then the s_s_l_, and other places where we don't always get the s_s_m_l_
first. 

This can lead to deadlock if the latter case then calls GetSessionState() with the session
ID for the session it already has the lock for.

I think it would be a good idea to move s_s_m_l_ to the bottom of this hierarchy as well by
ensuring that it should never be taken in conjunction with any locks. Any iteration over the
set of sessions can be done by copying out the shared_ptrs, releasing the lock and then iterating
again.


http://gerrit.cloudera.org:8080/#/c/4935/4/be/src/service/query-exec-state.cc
File be/src/service/query-exec-state.cc:

PS4, Line 586: Like the other Exec(), fill out as much profile information as we're able to.
> Update comment.
Done


PS4, Line 594: SetPlanningFinished();
> Why call SetPlanningFinished() here?
Because this is effectively the 'planning' path for metadata ops, some of which have result
sets and therefore metadata as well. 

I changed the method name to SetMetadataAvailable() to be clearer. I'm in two minds about
whether it's a good change or not.


Line 620:       block_on_wait_cv_.wait(l);
> Should we switch to promises instead of cond vars here?
I think that's a good idea (see the jira for this patch), but it's a separate change to simplify
the logic a bit.


-- 
To view, visit http://gerrit.cloudera.org:8080/4935
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I516357d2b5e9eb83e8209872cbe4c078c778a629
Gerrit-PatchSet: 4
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Henry Robinson <henry@cloudera.com>
Gerrit-Reviewer: Henry Robinson <henry@cloudera.com>
Gerrit-Reviewer: Sailesh Mukil <sailesh@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <tarmstrong@cloudera.com>
Gerrit-HasComments: Yes

Mime
View raw message