Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 3635D200C63 for ; Thu, 11 May 2017 22:22:44 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id 34D83160BC7; Thu, 11 May 2017 20:22:44 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 7CC67160BB3 for ; Thu, 11 May 2017 22:22:43 +0200 (CEST) Received: (qmail 32070 invoked by uid 500); 11 May 2017 20:22:42 -0000 Mailing-List: contact reviews-help@impala.incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list reviews@impala.incubator.apache.org Received: (qmail 32059 invoked by uid 99); 11 May 2017 20:22:42 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd2-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 11 May 2017 20:22:42 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd2-us-west.apache.org (ASF Mail Server at spamd2-us-west.apache.org) with ESMTP id D5C151A0247 for ; Thu, 11 May 2017 20:22:41 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd2-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 0.362 X-Spam-Level: X-Spam-Status: No, score=0.362 tagged_above=-999 required=6.31 tests=[RDNS_DYNAMIC=0.363, SPF_PASS=-0.001] autolearn=disabled Received: from mx1-lw-eu.apache.org ([10.40.0.8]) by localhost (spamd2-us-west.apache.org [10.40.0.9]) (amavisd-new, port 10024) with ESMTP id UYFRESguz8uT for ; Thu, 11 May 2017 20:22:39 +0000 (UTC) Received: from ip-10-146-233-104.ec2.internal (ec2-75-101-130-251.compute-1.amazonaws.com [75.101.130.251]) by mx1-lw-eu.apache.org (ASF Mail Server at mx1-lw-eu.apache.org) with ESMTPS id C234C5F642 for ; Thu, 11 May 2017 20:22:38 +0000 (UTC) Received: from localhost (localhost [127.0.0.1]) by ip-10-146-233-104.ec2.internal (8.14.4/8.14.4) with ESMTP id v4BKMbR5008287; Thu, 11 May 2017 20:22:37 GMT Message-Id: <201705112022.v4BKMbR5008287@ip-10-146-233-104.ec2.internal> Date: Thu, 11 May 2017 20:22:36 +0000 From: "Joe McDonnell (Code Review)" To: impala-cr@cloudera.com, reviews@impala.incubator.apache.org CC: Marcel Kornacker , Dan Hecht , Tim Armstrong Reply-To: joemcdonnell@cloudera.com X-Gerrit-MessageType: newpatchset Subject: =?UTF-8?Q?=5BImpala-ASF-CR=5D_IMPALA-4623=3A_Enable_file_handle_cache=0A?= X-Gerrit-Change-Id: Ibe5ff60971dd653c3b6a0e13928cfa9fc59d078d X-Gerrit-ChangeURL: X-Gerrit-Commit: 6809ae53ad3c7967eac0bed11b364c072bfd285c In-Reply-To: References: MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Content-Disposition: inline User-Agent: Gerrit/2.12.7 archived-at: Thu, 11 May 2017 20:22:44 -0000 Joe McDonnell has uploaded a new patch set (#7). Change subject: IMPALA-4623: Enable file handle cache ...................................................................... IMPALA-4623: Enable file handle cache Currently, every scan range maintains a file handle, even when multiple scan ranges are accessing the same file. Opening the file handles causes load on the NameNode, which can lead to scaling issues. There are two parts to this transaction: 1. Enable file handle caching by default 2. Share the file handle between scan ranges from the same file The scan range no longer maintains its own Hdfs file handle. On each read, the io thread will get the Hdfs file handle from the cache (opening it if necessary) and use that for the read. This allows multiple scan ranges on the same file to use the same file handle. Since the file offsets are no longer consistent for an individual scan range, all Hdfs reads need to either use hdfsPread or do a seek before reading. Additionally, since Hdfs read statistics are maintained on the file handle, the read statistics must be retrieved and cleared after each read. To manage contention, the file handle cache is now partitioned by a hash of the key into independent caches with independent locks. The allowed capacity of the file handle cache is split evenly among the partitions. File handles are evicted independently for each partition. The file handle cache maintains ownership of the file handles at all times, but it will not evict a file handle that is in use. If max_cached_file_handles is set to 0 or the the scan range is accessing data cached by Hdfs, the scan range will get a file handle from the cache and hold it until the scan range is closed. This mimics the existing behavior, except the file handle stays in the cache and is owned by the cache. Since it is in use, it will not be evicted. Tests: test_hdfs_fd_caching.py copies the files from an existing table into a new directory and uses that to create an external table. It queries the external table, then uses the hdfs commandline to manipulate the hdfs file (delete, move, etc). It queries again to make sure we don't crash. Then, it runs "invalidate metadata". It does not check the content of the rows returned. TODO: 1. Determine appropriate defaults. 2. Other tests a. File overwrite b. Any others? 3. For scan ranages that use Hdfs caching, should there be some sharing at the scanner level? Change-Id: Ibe5ff60971dd653c3b6a0e13928cfa9fc59d078d --- M be/src/exec/hdfs-scan-node-base.cc M be/src/exec/hdfs-scan-node-base.h M be/src/runtime/buffered-block-mgr.h A be/src/runtime/disk-io-mgr-handle-cache.h A be/src/runtime/disk-io-mgr-handle-cache.inline.h M be/src/runtime/disk-io-mgr-internal.h M be/src/runtime/disk-io-mgr-reader-context.cc M be/src/runtime/disk-io-mgr-scan-range.cc M be/src/runtime/disk-io-mgr.cc M be/src/runtime/disk-io-mgr.h M tests/metadata/test_refresh_partition.py M tests/query_test/test_hdfs_fd_caching.py 12 files changed, 702 insertions(+), 374 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/78/6478/7 -- To view, visit http://gerrit.cloudera.org:8080/6478 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newpatchset Gerrit-Change-Id: Ibe5ff60971dd653c3b6a0e13928cfa9fc59d078d Gerrit-PatchSet: 7 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Joe McDonnell Gerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Joe McDonnell Gerrit-Reviewer: Marcel Kornacker Gerrit-Reviewer: Tim Armstrong