impala-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Lars Volker (Code Review)" <ger...@cloudera.org>
Subject [Impala-CR](cdh5-trunk) IMPALA-2979: Fix scheduling on remote hosts
Date Fri, 03 Jun 2016 15:03:25 GMT
Lars Volker has posted comments on this change.

Change subject: IMPALA-2979: Fix scheduling on remote hosts
......................................................................


Patch Set 22:

(40 comments)

Thanks for the review. Please see PS23.

http://gerrit.cloudera.org:8080/#/c/2200/22/be/src/scheduling/simple-scheduler.cc
File be/src/scheduling/simple-scheduler.cc:

Line 236:   for(const BackendList::value_type& backend: backends) {
> for (
Done


Line 271:       for(const TTopicItem& item: delta.topic_entries) {
> formatting (and elsewhere)
Done


Line 312:       // TODO: Currently each non-empty statestore update triggers a rebuild of
these
> collect bigger todos at the bottom of the class comment in the .h
Done


Line 320:     // TODO: Inject global dependencies to make this code testable.
> same here
Done


Line 465:     lock_guard<mutex> l(backend_map_lock_);
> if you're trying to assign 100k scan ranges, this is still going to cause c
Done


Line 553:       AssignmentHeap::handle_type handle = node_assignment.backend_heap.push(new_info);
> you always update backend_heap and backend_handles together (and if you don
Done


Line 1058:   // Reset the random backend ranks and global assignment state.
> use actual var names in comments
Removed the method.


Line 1067:     AssignmentHeap::handle_type handle = global_assignment_.backend_heap.push(new_info);
> might as well use an inlined c'tor here
Done


Line 1078:     AssignmentInfo* node_assignment) {
> either change to context or to query_node_assignment, to differentiate more
Done


Line 1079:   // This method is called with a list of local backend candidates, so they also
must be
> also dcheck(!data_locations.empty())
Done


Line 1080:   // contained in backend_handles and thus in the global_assignment_ heap.
> what is backend_handles? what is the list of local backend candidates?
Done


Line 1081:   vector<size_t> candidates;
> what are these?
Added a comment.


Line 1084:   for (size_t i = 0; i < data_locations.size(); ++i) {
> why not just int?
Some compilers warn here because of the different types (size() returns size_t). Ours doesn't
changed it to int.


Line 1089:       const BackendAssignmentInfo& heap_data = *handle_it->second;
> rename variable to reflect its meaning
I combined both lines into one.


Line 1102:     // candidates in-place.
> more specifically ", and remove all others from 'candidates'" (instead of '
Done


Line 1103:     int64_t min_last_assigend = numeric_limits<int64_t>::max();
> spelling
Done


Line 1108:       const BackendAssignmentInfo& heap_data =
> meaningful name
Done


Line 1109:           *(global_assignment_.backend_handles[backend_ip]);
> who locks this?
Done


Line 1139:     AssignmentInfo* node_assignment) {
> change to query_assignment to differentiate from global_assignment_
Now changed to assignment_ctx. The whole method is much simpler now.


Line 1148:     // implies that the node_assignment heap must contain all of the handles, because
> all of which handles?
Done


Line 1149:     // backends in the global heap are ordered by the time of their last assignment.
If
> well, primarily by the number of assigned bytes
Done


Line 1153:     // list in between scheduling runs for multiple scan ranges of the same plan
node.
> i'm not following.
Done


Line 1158:     // scheduling of multiple scan ranges of the same plan node.
> does this get handled for global_assignment_?
Done


http://gerrit.cloudera.org:8080/#/c/2200/22/be/src/scheduling/simple-scheduler.h
File be/src/scheduling/simple-scheduler.h:

Line 106:     int64_t last_assignment_timestamp;
> comment that this is a logical timestamp, derived from num_assignments_
Done


Line 123:   typedef boost::heap::binomial_heap<BackendAssignmentInfo,
> does this really require anything from boost (and not std)? we're trying to
Yes, we need the heap to be addressable to perform a decrease key operation on arbitrary element
in the heap. std::priority_queue doesn't have a way to do this, and using a vector + std::make_heap
is  less efficient O(n) vs O(logn) over the number of backends. Of course it might be viable
for the numbers we expect, let me know if you want me to give it a try.


Line 141:   /// Total number of assignments made during the lifetime of the scheduler.
> clarify what you mean by assignment (per query, per plan node, per scan ran
Done


Line 210:   /// data from the given data_loation.
> what is data_location?
Yes, I updated the comment.


Line 216:   /// node_assignments:     Used to track per-query and per-host assignment information.
> which backends does this track? is this an output param? the global state?
Updated the comment.


Line 254:   Status ComputeScanRangeAssignmentForQuery(const TQueryExecRequest& exec_request,
> isn't the 'for query' implied?
Done


Line 262:   /// First we assemble a list of candidate backends, from which the scan range
can be
> no ,
Done


Line 263:   /// read. Backends are classified by their memory distance to the data, and the
closest
> briefly explain memory distance, it's not a standard term
Done


Line 266:   /// if they were further away. Only backends running on one of the replicas of
the scan
> i don't think you should describe the steps of the algorithm here (you coul
Done


Line 273:   /// list, resulting in a local read. First a search is performed to find the candidate
> yep, this is way too detailed.
Done


Line 307:   Status ComputeScanRangeAssignmentForPlanNode(PlanNodeId node_id,
> same here, the parameters differentiate it from 'for query'.
Done


Line 350:   /// Lookup the IP address of 'hostname' and return whether the lookup was successful.
If
> lookup is a noun, to look up is the verb.
Done


Line 356:   /// Resolve a host to one of its IP addresses by calling getaddrinfo() and sorting
the
> comment on semantics of result, not how it's implemented
Done


Line 358:   Status HostnameToIpAddr(const Hostname& hostname, IpAddr* ip);
> on the face of it this looks like LookUpBackendIp. how is it different?
One looks at known backends only, the other on tries to resolve via the OS. I added to both
comments.


Line 371:   /// minimum number of assigned bytes. If backends have been assigned equal amounts
of
> i guess this is "with the minimum number of assigned bytes (according to 'c
Done


Line 376:       bool global_backend_order, AssignmentInfo* context);
> is this the global assignment info?
Added to the comment.


Line 378:   /// Among backend hosts collocated with 'data_locations', select the one with
the
> collocated?
Done


-- 
To view, visit http://gerrit.cloudera.org:8080/2200
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I044f83806fcde820fcb38047cf6b8e780d803858
Gerrit-PatchSet: 22
Gerrit-Project: Impala
Gerrit-Branch: cdh5-trunk
Gerrit-Owner: Lars Volker <lv@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv@cloudera.com>
Gerrit-Reviewer: Marcel Kornacker <marcel@cloudera.com>
Gerrit-Reviewer: Mostafa Mokhtar <mmokhtar@cloudera.com>
Gerrit-Reviewer: anujphadke <aphadke@cloudera.com>
Gerrit-HasComments: Yes

Mime
View raw message