impala-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Joe McDonnell (Code Review)" <>
Subject [Impala-ASF-CR] IMPALA-4623: Enable file handle cache
Date Tue, 09 May 2017 04:23:17 GMT
Joe McDonnell has posted comments on this change.

Change subject: IMPALA-4623: Enable file handle cache

Patch Set 4:


Still fixing the test, but here is an update on the rest of it.
File be/src/exec/

Line 894:         runtime_state_->io_mgr()->cached_file_handles_hit_count(reader_context_));
> pretty awkward. why not call the reader directly (and maybe introduce a new
DiskIoRequestContext is an opaque type. disk-io-mgr.h declares it, but the body is defined
in disk-io-mgr-internal.h, and we don't include that anywhere outside of disk-io-mgr*.

It is annoying.
File be/src/exec/hdfs-scan-node-base.h:

Line 441:   RuntimeProfile::Counter* cached_file_handles_hit_count_;
> initialize to nullptr here, that way you don't need to do it in the c'tor (
File be/src/runtime/buffered-block-mgr.h:

Line 25: #include <boost/unordered_map.hpp>
> why?
boost/unordered_map.hpp is needed, but it was being implicitly included by including lru-cache.h
in disk-io-mgr.h. I removed lru-cache.h from disk-io-mgr.h, which broke this.
File be/src/runtime/

Line 342:       io_mgr_->CacheOrCloseFileHandle(file(), exclusive_hdfs_fh_);
> yes, the name really needs to distinguish between cache operations and dire

Line 548:         remote_bytes_ += stats->totalBytesRead - stats->totalLocalBytesRead;
> why not stick that into reader_ as well?
Added a comment in disk-io-mgr.h
File be/src/runtime/

Line 1309:   string key = string(fname) + ":" + std::to_string(mtime);
> you can get rid of dealing with strings here (and doing a bunch of allocati
I use this as the key for the map, so switching the hash alone wouldn't eliminate this. However,
I could hash on the filename and use the filename as the key, then as it is iterating through
file handles for that filename, find one that is free with a matching mtime.

Line 1315:   // Get lock
> that's obvious, no need to paraphrase a single statement (unless there's so

Line 1318:   // Look up in the multimap (find range of values for key)
> same here

Line 1326:     if (!elem->reference_count_) {
> use explicit numeric comparisons, this isn't a bool (although maybe it shou
Changed to bool.

Line 1331:       ImpaladMetrics::IO_MGR_NUM_CACHED_FILE_HANDLES->Increment(-1L);
> hm, i'm not sure we should do that. it's still a cached file handle.
I'm splitting hairs about the exact meaning of cached. 
Every file handle is always in the cache, so IO_MGR_NUM_CACHED_FILE_HANDLES could be the same
as IO_MGR_NUM_FILE_HANDLES_OUTSTANDING. That would be understandable.

The meaning I'm using here is that this is a file handle that is in the cache and not in use.
I'm not sure if that is useful.

Line 1395:     if (p.size_ > p.capacity_) {
> single line
File be/src/runtime/disk-io-mgr.h:

Line 231:   /// Threads checkout a file handle for exclusive access and return it when finished.
> "check out" (checkout is a noun)

Line 237:   template <size_t NUM_PARTITIONS>
> please pull this along with hdfsfilehandle out into a separate file, this f
Moved this out to disk-io-mgr-handle-cache.h/disk-io-mgr-handle-cache.inline.h. disk-io-mgr-handle-cache.h
is included in disk-io-mgr.h. disk-io-mgr-handle-cache.inline.h is only used in

Line 242:     /// first k partitions will each have one additional file handle.
> let's make the size uniform across all partitions (e.g., by rounding up), t

Line 273:        : fh_(fh), reference_count_(0) {}
> single line

Line 276:       int reference_count_;
> why refcount instead of just 'bool in_use' or something like that?
Switched to in_use.

Line 282:     /// the partitions are aligned to cache line boundaries (64 bytes).
> hm, isn't there some predefined constant somewhere for the cache line size?
std::hardware_destructive_interference_size is in C++17. There is a way to query the system
to get it, but I haven't seen a compile time definition.

Line 549:     Status Open(bool use_file_handle_cache) WARN_UNUSED_RESULT;
> does it ever make sense to call open on the same scan range with different 
This happens on the HDFS caching codepath. If we think a scan range is cached, we call Open(false),
but then if caching falls through, we call Close and then reuse the scan range on the normal
codepath, which can call it with either true or false. 

Clarified the comment.

Line 573:     void* meta_data_;
> initialize nullptr here (for all pointer members)

Line 587:     /// Total number of bytes read remotely
> point out why we need to maintain this particular metric instead of in requ
Added some explanation.

Line 601:     /// handle, and no sharing takes place.
> also mention when this is populated and how long it's valid.
Clarified this.

Line 856:   HdfsFileHandle* OpenHdfsFile(const hdfsFS& fs,
> if this always returns a cached handle, would be good to reflect that in th
Changed the names to mirror the cache API, since this is just a wrapper around the cache.

Line 860:   void CacheOrCloseFileHandle(const char* fname, HdfsFileHandle* fid);
> rename

To view, visit
To unsubscribe, visit

Gerrit-MessageType: comment
Gerrit-Change-Id: Ibe5ff60971dd653c3b6a0e13928cfa9fc59d078d
Gerrit-PatchSet: 4
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Joe McDonnell <>
Gerrit-Reviewer: Dan Hecht <>
Gerrit-Reviewer: Joe McDonnell <>
Gerrit-Reviewer: Marcel Kornacker <>
Gerrit-Reviewer: Tim Armstrong <>
Gerrit-HasComments: Yes

View raw message