impala-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Henry Robinson (Code Review)" <ger...@cloudera.org>
Subject [Impala-ASF-CR] IMPALA-4348 / IMPALA-4333: Improve coordinator fragment cancellation
Date Fri, 28 Oct 2016 00:22:13 GMT
Henry Robinson has uploaded a new patch set (#4).

Change subject: IMPALA-4348 / IMPALA-4333: Improve coordinator fragment cancellation
......................................................................

IMPALA-4348 / IMPALA-4333: Improve coordinator fragment cancellation

This patch fixes two problems:

1. If a fragment instance is cancelled (by the coordinator) concurrently
to the coordinator calling GetNext() on its PlanRootSink, GetNext() may
access a destroyed PlanRootSink object and crash (IMPALA-4333).

2. If ExecPlanFragment() times out, but is successful, the coordinator
will never call CloseConsumer() to release its side of the PlanRootSink,
and the fragment instance will never finish (IMPALA-4348).

The following changes address this:

* Make PlanFragmentExecutor::Cancel() call CloseConsumer() on its root
  sink. This ensures that, no matter what the reason, the fragment
  instance will eventually terminate. Otherwise the coordinator may not
  call CloseConsumer() if it could not access the root instance's
  FragmentExecState.

* Ensure that the PlanRootSink has a lifetime at least as long as the
  coordinator object by taking a shared_ptr reference to its parent
  FragmentExecState object. Even if the fragment instance finishes, the
  object will still be valid until the coordinator releases this
  reference. In the future, we expect this to be replaced by an explicit
  reference take / relinquish API.

* Ensure the coordinator tries to cancel fragment instances if any
  attempt was made to start them, not just if the RPC was a
  success. This deals with the timeout case where the RPC was slow, but
  successful.

This patch relies on a fix for IMPALA-4383, which ensures that fragment
instances will always try to send reports, and so will always detect an
error and cancel themselves.

Testing:

* Ran TestRPCTimeout.test_execplanfragment_timeout in a loop for 90
minutes. Patch addresses hangs and failure modes that were observed
previously.

* Added a manual sleep just before Coordinator::GetNext() calls
  root_sink_->GetNext(), and cancelled a query manually in that
  window. Confirmed that query cancels successfully.

Change-Id: I5b4e25c1d658b3929182ba5e56b5c5e881dd394a
---
M be/src/exec/plan-root-sink.cc
M be/src/exec/plan-root-sink.h
M be/src/runtime/coordinator.cc
M be/src/runtime/coordinator.h
M be/src/runtime/plan-fragment-executor.cc
5 files changed, 60 insertions(+), 35 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/65/4865/4
-- 
To view, visit http://gerrit.cloudera.org:8080/4865
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I5b4e25c1d658b3929182ba5e56b5c5e881dd394a
Gerrit-PatchSet: 4
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Henry Robinson <henry@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dhecht@cloudera.com>
Gerrit-Reviewer: Henry Robinson <henry@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <tarmstrong@cloudera.com>

Mime
View raw message