impala-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Thomas Tauber-Marshall (Code Review)" <ger...@cloudera.org>
Subject [Impala-ASF-CR] IMPALA-5749: coordinator race hits DCHECK 'num remaining backends > 0'
Date Fri, 04 Aug 2017 18:14:29 GMT
Thomas Tauber-Marshall has posted comments on this change.

Change subject: IMPALA-5749: coordinator race hits DCHECK 'num_remaining_backends_ > 0'
......................................................................


Patch Set 2:

(1 comment)

> > Does this trigger only when there are two concurrent calls to
 > > UpdateBackendExecStatus() from the same backend? If so, do we
 > > understand why that happens so often?
 > 
 > My understanding is this:
 > A fragment instance sends reports every 'n' seconds. Due to a
 > congested network, two of these reports for the same fragment
 > instance from a backend can arrive at the coordinator and start
 > being processed at around the same time, hence leading to this
 > issue.
 > 
 > Ideally a second report cannot be send until the first one is ACKd
 > by the coordinator, since a lock is held until the report is ACKd,
 > in the ReportProfileThread(); but there is only one case where a
 > second report will be sent before the first one is responded to,
 > i.e.  from FragmentInstanceState::Finalize().
 > 
 > So ReportProfileThread() sends the one report of the last
 > finstance, then Finalize() sends the second report of the same
 > finstance before the first one is responded to.

This may be possible, but the query I've been using to repro it runs fast enough that there
are never any updates sent by the report thread, just from Finalize, so I haven't observed
it.

The reason I've been seeing it fail is: there are two different fragments from the same backend
that send reports at the same time. When they both reach Coordinator::UpdateBackendExecStatus(),
they both get past the IsDone() check.

Then ApplyExecStatusReport() gets called twice but both calls return true in the out parameter
'done', because the first call contained a fragment with an error status which causes the
backend to be considered done, so both threads end up decreasing num_remaining_backends_ for
the same backend.

http://gerrit.cloudera.org:8080/#/c/7577/1/be/src/runtime/coordinator-backend-state.h
File be/src/runtime/coordinator-backend-state.h:

Line 75:   /// this update was not applied because this backend has previously completed.
This can
> Please also add:
Done


-- 
To view, visit http://gerrit.cloudera.org:8080/7577
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I1528661e5df6d9732ebfeb414576c82ec5c92241
Gerrit-PatchSet: 2
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Thomas Tauber-Marshall <tmarshall@cloudera.com>
Gerrit-Reviewer: Henry Robinson <henry@cloudera.com>
Gerrit-Reviewer: Michael Ho <kwho@cloudera.com>
Gerrit-Reviewer: Sailesh Mukil <sailesh@cloudera.com>
Gerrit-Reviewer: Thomas Tauber-Marshall <tmarshall@cloudera.com>
Gerrit-HasComments: Yes

Mime
View raw message