impala-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Lars Volker (Code Review)" <>
Subject [Impala-CR](cdh5-trunk) IMPALA-2979: Fix scheduling on remote hosts
Date Fri, 03 Jun 2016 15:03:25 GMT
Lars Volker has posted comments on this change.

Change subject: IMPALA-2979: Fix scheduling on remote hosts

Patch Set 22:


Thanks for the review. Please see PS23.
File be/src/scheduling/

Line 236:   for(const BackendList::value_type& backend: backends) {
> for (

Line 271:       for(const TTopicItem& item: delta.topic_entries) {
> formatting (and elsewhere)

Line 312:       // TODO: Currently each non-empty statestore update triggers a rebuild of
> collect bigger todos at the bottom of the class comment in the .h

Line 320:     // TODO: Inject global dependencies to make this code testable.
> same here

Line 465:     lock_guard<mutex> l(backend_map_lock_);
> if you're trying to assign 100k scan ranges, this is still going to cause c

Line 553:       AssignmentHeap::handle_type handle = node_assignment.backend_heap.push(new_info);
> you always update backend_heap and backend_handles together (and if you don

Line 1058:   // Reset the random backend ranks and global assignment state.
> use actual var names in comments
Removed the method.

Line 1067:     AssignmentHeap::handle_type handle = global_assignment_.backend_heap.push(new_info);
> might as well use an inlined c'tor here

Line 1078:     AssignmentInfo* node_assignment) {
> either change to context or to query_node_assignment, to differentiate more

Line 1079:   // This method is called with a list of local backend candidates, so they also
must be
> also dcheck(!data_locations.empty())

Line 1080:   // contained in backend_handles and thus in the global_assignment_ heap.
> what is backend_handles? what is the list of local backend candidates?

Line 1081:   vector<size_t> candidates;
> what are these?
Added a comment.

Line 1084:   for (size_t i = 0; i < data_locations.size(); ++i) {
> why not just int?
Some compilers warn here because of the different types (size() returns size_t). Ours doesn't
changed it to int.

Line 1089:       const BackendAssignmentInfo& heap_data = *handle_it->second;
> rename variable to reflect its meaning
I combined both lines into one.

Line 1102:     // candidates in-place.
> more specifically ", and remove all others from 'candidates'" (instead of '

Line 1103:     int64_t min_last_assigend = numeric_limits<int64_t>::max();
> spelling

Line 1108:       const BackendAssignmentInfo& heap_data =
> meaningful name

Line 1109:           *(global_assignment_.backend_handles[backend_ip]);
> who locks this?

Line 1139:     AssignmentInfo* node_assignment) {
> change to query_assignment to differentiate from global_assignment_
Now changed to assignment_ctx. The whole method is much simpler now.

Line 1148:     // implies that the node_assignment heap must contain all of the handles, because
> all of which handles?

Line 1149:     // backends in the global heap are ordered by the time of their last assignment.
> well, primarily by the number of assigned bytes

Line 1153:     // list in between scheduling runs for multiple scan ranges of the same plan
> i'm not following.

Line 1158:     // scheduling of multiple scan ranges of the same plan node.
> does this get handled for global_assignment_?
File be/src/scheduling/simple-scheduler.h:

Line 106:     int64_t last_assignment_timestamp;
> comment that this is a logical timestamp, derived from num_assignments_

Line 123:   typedef boost::heap::binomial_heap<BackendAssignmentInfo,
> does this really require anything from boost (and not std)? we're trying to
Yes, we need the heap to be addressable to perform a decrease key operation on arbitrary element
in the heap. std::priority_queue doesn't have a way to do this, and using a vector + std::make_heap
is  less efficient O(n) vs O(logn) over the number of backends. Of course it might be viable
for the numbers we expect, let me know if you want me to give it a try.

Line 141:   /// Total number of assignments made during the lifetime of the scheduler.
> clarify what you mean by assignment (per query, per plan node, per scan ran

Line 210:   /// data from the given data_loation.
> what is data_location?
Yes, I updated the comment.

Line 216:   /// node_assignments:     Used to track per-query and per-host assignment information.
> which backends does this track? is this an output param? the global state?
Updated the comment.

Line 254:   Status ComputeScanRangeAssignmentForQuery(const TQueryExecRequest& exec_request,
> isn't the 'for query' implied?

Line 262:   /// First we assemble a list of candidate backends, from which the scan range
can be
> no ,

Line 263:   /// read. Backends are classified by their memory distance to the data, and the
> briefly explain memory distance, it's not a standard term

Line 266:   /// if they were further away. Only backends running on one of the replicas of
the scan
> i don't think you should describe the steps of the algorithm here (you coul

Line 273:   /// list, resulting in a local read. First a search is performed to find the candidate
> yep, this is way too detailed.

Line 307:   Status ComputeScanRangeAssignmentForPlanNode(PlanNodeId node_id,
> same here, the parameters differentiate it from 'for query'.

Line 350:   /// Lookup the IP address of 'hostname' and return whether the lookup was successful.
> lookup is a noun, to look up is the verb.

Line 356:   /// Resolve a host to one of its IP addresses by calling getaddrinfo() and sorting
> comment on semantics of result, not how it's implemented

Line 358:   Status HostnameToIpAddr(const Hostname& hostname, IpAddr* ip);
> on the face of it this looks like LookUpBackendIp. how is it different?
One looks at known backends only, the other on tries to resolve via the OS. I added to both

Line 371:   /// minimum number of assigned bytes. If backends have been assigned equal amounts
> i guess this is "with the minimum number of assigned bytes (according to 'c

Line 376:       bool global_backend_order, AssignmentInfo* context);
> is this the global assignment info?
Added to the comment.

Line 378:   /// Among backend hosts collocated with 'data_locations', select the one with
> collocated?

To view, visit
To unsubscribe, visit

Gerrit-MessageType: comment
Gerrit-Change-Id: I044f83806fcde820fcb38047cf6b8e780d803858
Gerrit-PatchSet: 22
Gerrit-Project: Impala
Gerrit-Branch: cdh5-trunk
Gerrit-Owner: Lars Volker <>
Gerrit-Reviewer: Lars Volker <>
Gerrit-Reviewer: Marcel Kornacker <>
Gerrit-Reviewer: Mostafa Mokhtar <>
Gerrit-Reviewer: anujphadke <>
Gerrit-HasComments: Yes

View raw message