Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 66E4A200C4D for ; Wed, 5 Apr 2017 22:29:52 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id 6561A160B94; Wed, 5 Apr 2017 20:29:52 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 858E7160B76 for ; Wed, 5 Apr 2017 22:29:51 +0200 (CEST) Received: (qmail 66634 invoked by uid 500); 5 Apr 2017 20:29:50 -0000 Mailing-List: contact reviews-help@impala.incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list reviews@impala.incubator.apache.org Received: (qmail 66623 invoked by uid 99); 5 Apr 2017 20:29:50 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd4-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 05 Apr 2017 20:29:50 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd4-us-west.apache.org (ASF Mail Server at spamd4-us-west.apache.org) with ESMTP id 1D82AC023E for ; Wed, 5 Apr 2017 20:29:50 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd4-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 0.363 X-Spam-Level: X-Spam-Status: No, score=0.363 tagged_above=-999 required=6.31 tests=[RDNS_DYNAMIC=0.363, SPF_PASS=-0.001, URIBL_BLOCKED=0.001] autolearn=disabled Received: from mx1-lw-us.apache.org ([10.40.0.8]) by localhost (spamd4-us-west.apache.org [10.40.0.11]) (amavisd-new, port 10024) with ESMTP id VcvTHGPd_JFj for ; Wed, 5 Apr 2017 20:29:48 +0000 (UTC) Received: from ip-10-146-233-104.ec2.internal (ec2-75-101-130-251.compute-1.amazonaws.com [75.101.130.251]) by mx1-lw-us.apache.org (ASF Mail Server at mx1-lw-us.apache.org) with ESMTPS id AE6B15FAE7 for ; Wed, 5 Apr 2017 20:29:48 +0000 (UTC) Received: from localhost (localhost [127.0.0.1]) by ip-10-146-233-104.ec2.internal (8.14.4/8.14.4) with ESMTP id v35KTmZE014240; Wed, 5 Apr 2017 20:29:48 GMT Message-Id: <201704052029.v35KTmZE014240@ip-10-146-233-104.ec2.internal> Date: Wed, 5 Apr 2017 20:29:48 +0000 From: "Joe McDonnell (Code Review)" To: impala-cr@cloudera.com, reviews@impala.incubator.apache.org CC: Dan Hecht , Marcel Kornacker , Tim Armstrong Reply-To: joemcdonnell@cloudera.com X-Gerrit-MessageType: comment Subject: =?UTF-8?Q?=5BImpala-ASF-CR=5D_IMPALA-4623=3A_Thread_level_file_handle_caching=0A?= X-Gerrit-Change-Id: Ibe5ff60971dd653c3b6a0e13928cfa9fc59d078d X-Gerrit-ChangeURL: X-Gerrit-Commit: d9cba91ae1938bd811cb10269642871471cbc3ed In-Reply-To: References: MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Content-Disposition: inline User-Agent: Gerrit/2.12.7 archived-at: Wed, 05 Apr 2017 20:29:52 -0000 Joe McDonnell has posted comments on this change. Change subject: IMPALA-4623: Thread level file handle caching ...................................................................... Patch Set 1: (19 comments) http://gerrit.cloudera.org:8080/#/c/6478/1/be/src/runtime/disk-io-mgr.cc File be/src/runtime/disk-io-mgr.cc: Line 70: DEFINE_uint64(max_cached_file_handles, 10000, "Maximum number of HDFS file handles " > on further consideration, esp. in light of what tim brought up (separate re Comment out of date. Line 388: std::vector num_threads_per_disk; > don't use std:: in .cc files Done Line 1050: void DiskIoMgr::WorkLoop(DiskQueue* disk_queue, int thread_file_handle_cache_size) { > you use 'fh' elsewhere for 'file handle', feel free to use that elsewhere i removed Line 1052: if (thread_file_handle_cache_size > 0) { > dcheck that it's not < 0 removed Line 1081: Write(worker_context, static_cast(range)); > what about writing? We currently don't do writes to HDFS via the IO manager. The writes here are to local files. When we are writing tables, we open an Hdfs file handle in hdfs-table-sink.cc. I don't think these file handles can be reused, because we open it in write only mode. The file handle cache opens in read only mode. I don't see any way to convert them. Line 1337: : capacity_(capacity) {} > move to .h removed Line 1339: DiskIoMgr::ThreadFileHandleCache::~ThreadFileHandleCache() {} > remove removed Line 1343: std::string file_key = std::string(fname) + std::to_string(mtime); > is this guaranteed to be unique w/o some sort of separator? removed Line 1351: EvictFileHandle(); > you only call this once, and it's a small function, you might as well inlin removed http://gerrit.cloudera.org:8080/#/c/6478/1/be/src/runtime/disk-io-mgr.h File be/src/runtime/disk-io-mgr.h: Line 233: /// This is a single-threaded LRU cache for Hdfs file handles. The cache creates and > I guess I was thinking that each thread would temporarily check out a file The approach changed to using a single cache at the DiskIoMgr level like the existing cache. Line 235: class ThreadFileHandleCache { > 'thread' is generic. LruFileHandleCache? or simply FileHandleCache? Removed. Line 240: /// Gets an Hdfs file handle for the specified file. It will lookup and use a > "look up" Removed Line 245: HdfsCachedFileHandle* GetFileHandle(const hdfsFS& fs, const char* fname, int64_t mtime); > long line Removed Line 248: > remove blank line Removed Line 256: class LruListElement { > make it a struct, class member vars are supposed to be private. Removed Line 260: ~LruListElement() {} > remove Removed Line 533: /// in the reader context. > "reader context" = reader_ member. referencing the member directly is more N/A in new code Line 544: /// the DiskIoRequestContext. This clears the statistics on this file handle. > why not make that a function of diskiorequestcontext The two statistics that need input from the ScanRange are unexpected_remote_bytes_ and num_remote_ranges_. The unexpected_remote_bytes_ requires knowing the value of expected_local_ for the ScanRange. There is some tracing that prints when a ScanRange has unexpected remote bytes (see ScanRange::Close for where it is now). The tracing is the only thing that would prevent this from being put directly into DiskIoMgrContext if we passed in expected_local_ or the ScanRange to the function. num_remote_ranges_ is impacted by the fact that we can call GetStatistics multiple times per scan range. It needs to know if this range has already been counted at the DiskIoMgrContext level. This can be worked around. Line 574: /// at read time, so hdfs_file_ is null. > is there a need to have it both ways for hdfs files, instead of always open We need to keep the hdfs file open as long as the cached buffer is around so that the hdfs file doesn't get closed. We also use this when file handle caching is off (max_cached_file_handles=0). -- To view, visit http://gerrit.cloudera.org:8080/6478 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: Ibe5ff60971dd653c3b6a0e13928cfa9fc59d078d Gerrit-PatchSet: 1 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Joe McDonnell Gerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Joe McDonnell Gerrit-Reviewer: Marcel Kornacker Gerrit-Reviewer: Tim Armstrong Gerrit-HasComments: Yes