impala-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Joe McDonnell (Code Review)" <ger...@cloudera.org>
Subject [Impala-ASF-CR] IMPALA-4623: Thread level file handle caching
Date Sat, 25 Mar 2017 00:40:01 GMT
Joe McDonnell has uploaded a new change for review.

  http://gerrit.cloudera.org:8080/6478

Change subject: IMPALA-4623: Thread level file handle caching
......................................................................

IMPALA-4623: Thread level file handle caching

Currently, every scan range maintains a file handle, even
when multiple scan ranges are accessing the same file.
Open the file handles causes load on the NameNode, which
can lead to scaling issues.

There are two parts to this transaction:
1. Enable file handle caching by default
2. Introduce a thread file handle cache to share file
handles between scan ranges

For thread file handle caching, the scan range no longer
maintains its own Hdfs file handle. On each read, the io
thread will get the Hdfs file handle from its cache
(opening it if necessary) and use that for the read.
This allows multiple scan ranges on the same file to
use the same file handle. Since the file offsets are
no longer consistent for an individual scan range,
all Hdfs reads are now done with hdfsPread. Additionally,
since Hdfs read statistics are maintained on the file
handle, the read statistics must be retrieved and cleared
after each read.

Thread file handle caching is not used for local non-Hdfs
files.

Scan ranges that are accessing data cached by Hdfs
are done in the scanner threads and do not use thread
file handle caching. Instead, they use the existing
global file handle cache. These maintain a file handle
per scan range as before.

When Impala starts up with max_cached_file_handles=N,
the global cache is given 50% of the allowed file handles.
The other 50% is split evenly between all io threads.

TODO:
1. Determine appropriate defaults.
2. Maintain appropriate metrics.
3. Write tests
4. For scan ranages that use Hdfs caching, should there
be some sharing at the scanner level?

Change-Id: Ibe5ff60971dd653c3b6a0e13928cfa9fc59d078d
---
M be/src/runtime/disk-io-mgr-scan-range.cc
M be/src/runtime/disk-io-mgr.cc
M be/src/runtime/disk-io-mgr.h
3 files changed, 216 insertions(+), 65 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/78/6478/1
-- 
To view, visit http://gerrit.cloudera.org:8080/6478
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newchange
Gerrit-Change-Id: Ibe5ff60971dd653c3b6a0e13928cfa9fc59d078d
Gerrit-PatchSet: 1
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Joe McDonnell <joemcdonnell@cloudera.com>

Mime
View raw message