impala-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Marcel Kornacker (Code Review)" <>
Subject [Impala-ASF-CR] IMPALA-4623: Thread level file handle caching
Date Tue, 28 Mar 2017 00:30:48 GMT
Marcel Kornacker has posted comments on this change.

Change subject: IMPALA-4623: Thread level file handle caching

Patch Set 1:

File be/src/runtime/

Line 70: DEFINE_uint64(max_cached_file_handles, 10000, "Maximum number of HDFS file handles
on further consideration, esp. in light of what tim brought up (separate remote io queue),
we should have separate cache capacities for cached reads, disk queues, and both remote io

let's retire this flag (it's safe to remove) and introduce new flags that are more descriptive

Line 388:   std::vector<int> num_threads_per_disk;
don't use std:: in .cc files

Line 1050: void DiskIoMgr::WorkLoop(DiskQueue* disk_queue, int thread_file_handle_cache_size)
you use 'fh' elsewhere for 'file handle', feel free to use that elsewhere in order to end
up with more manageable variable names.

Line 1052:   if (thread_file_handle_cache_size > 0) {
dcheck that it's not < 0

Line 1081:       Write(worker_context, static_cast<WriteRange*>(range));
what about writing?

Line 1337:   : capacity_(capacity) {}
move to .h

Line 1339: DiskIoMgr::ThreadFileHandleCache::~ThreadFileHandleCache() {}

Line 1343:   std::string file_key = std::string(fname) + std::to_string(mtime);
is this guaranteed to be unique w/o some sort of separator?

Line 1351:       EvictFileHandle();
you only call this once, and it's a small function, you might as well inline it.
File be/src/runtime/disk-io-mgr.h:

Line 233:   /// This is a single-threaded LRU cache for Hdfs file handles. The cache creates
> High-level observation: my understanding is that this changes the upper bou
each thread needs its own file handle in order to get accurate io stats.

Line 235:   class ThreadFileHandleCache {
'thread' is generic. LruFileHandleCache? or simply FileHandleCache?

Line 240:     /// Gets an Hdfs file handle for the specified file. It will lookup and use
"look up"

Line 245:     HdfsCachedFileHandle* GetFileHandle(const hdfsFS& fs, const char* fname,
int64_t mtime);
long line

Line 248: 
remove blank line

Line 256:     class LruListElement {
make it a struct, class member vars are supposed to be private.

Line 260:       ~LruListElement() {}

Line 533:     /// in the reader context.
"reader context" = reader_ member. referencing the member directly is more explicit.

Line 544:     /// the DiskIoRequestContext. This clears the statistics on this file handle.
why not make that a function of diskiorequestcontext

Line 564:     int64_t remote_bytes_;
why here and not diskiorequestcontext?

Line 574:     /// at read time, so hdfs_file_ is null.
is there a need to have it both ways for hdfs files, instead of always opening at read time,
which would make the control flow easier to follow? it shouldn't matter for hdfs caching reads,
they get the entire scan range in one go anyway.

To view, visit
To unsubscribe, visit

Gerrit-MessageType: comment
Gerrit-Change-Id: Ibe5ff60971dd653c3b6a0e13928cfa9fc59d078d
Gerrit-PatchSet: 1
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Joe McDonnell <>
Gerrit-Reviewer: Marcel Kornacker <>
Gerrit-Reviewer: Tim Armstrong <>
Gerrit-HasComments: Yes

View raw message