Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 59748200C79 for ; Fri, 19 May 2017 23:30:10 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id 563BF160BD1; Fri, 19 May 2017 21:30:10 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 73F17160BB0 for ; Fri, 19 May 2017 23:30:09 +0200 (CEST) Received: (qmail 50411 invoked by uid 500); 19 May 2017 21:30:08 -0000 Mailing-List: contact reviews-help@impala.incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list reviews@impala.incubator.apache.org Received: (qmail 50395 invoked by uid 99); 19 May 2017 21:30:08 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd3-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 19 May 2017 21:30:08 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd3-us-west.apache.org (ASF Mail Server at spamd3-us-west.apache.org) with ESMTP id F1FE21803A3 for ; Fri, 19 May 2017 21:30:07 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd3-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 0.363 X-Spam-Level: X-Spam-Status: No, score=0.363 tagged_above=-999 required=6.31 tests=[RDNS_DYNAMIC=0.363, SPF_PASS=-0.001, URIBL_BLOCKED=0.001] autolearn=disabled Received: from mx1-lw-us.apache.org ([10.40.0.8]) by localhost (spamd3-us-west.apache.org [10.40.0.10]) (amavisd-new, port 10024) with ESMTP id y6ULD1qnn1r1 for ; Fri, 19 May 2017 21:30:06 +0000 (UTC) Received: from ip-10-146-233-104.ec2.internal (ec2-75-101-130-251.compute-1.amazonaws.com [75.101.130.251]) by mx1-lw-us.apache.org (ASF Mail Server at mx1-lw-us.apache.org) with ESMTPS id 7BDA25F20C for ; Fri, 19 May 2017 21:30:06 +0000 (UTC) Received: from localhost (localhost [127.0.0.1]) by ip-10-146-233-104.ec2.internal (8.14.4/8.14.4) with ESMTP id v4JLU5qt007521; Fri, 19 May 2017 21:30:05 GMT Message-Id: <201705192130.v4JLU5qt007521@ip-10-146-233-104.ec2.internal> Date: Fri, 19 May 2017 21:30:05 +0000 From: "Joe McDonnell (Code Review)" To: impala-cr@cloudera.com, reviews@impala.incubator.apache.org CC: Dan Hecht , Marcel Kornacker , Tim Armstrong Reply-To: joemcdonnell@cloudera.com X-Gerrit-MessageType: comment Subject: =?UTF-8?Q?=5BImpala-ASF-CR=5D_IMPALA-4623=3A_Enable_file_handle_cache=0A?= X-Gerrit-Change-Id: Ibe5ff60971dd653c3b6a0e13928cfa9fc59d078d X-Gerrit-ChangeURL: X-Gerrit-Commit: 40b0cb4f27324712cef043c3065d60e969cf82dd In-Reply-To: References: MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Content-Disposition: inline User-Agent: Gerrit/2.12.7 archived-at: Fri, 19 May 2017 21:30:10 -0000 Joe McDonnell has posted comments on this change. Change subject: IMPALA-4623: Enable file handle cache ...................................................................... Patch Set 8: (15 comments) http://gerrit.cloudera.org:8080/#/c/6478/8/be/src/runtime/disk-io-mgr-handle-cache.h File be/src/runtime/disk-io-mgr-handle-cache.h: Line 88: /// with a call to ReleaseFileHandle to release exclusive control. > this should also evict handles if a more recent mtime for a file with exist Added a TODO for this. I plan on dealing with this when doing eviction by timeout. Here is a scenario that I might be concerned about for immediate eviction when seeing a higher mtime: 1. ScanRange #1 starts up 2. ScanRange #1 does a read, putting a file handle with mtime=5 in the cache. 3. The file is modified and now has an mtime=6. 4. ScanRange #2 starts up 5. ScanRange #2 does a read, putting a file handle with mtime=6 in the cache and removing the mtime=5 file handle. 6. ScanRange #1 wants to do another read and goes to get a file handle from the cache, but mtime=5 file handles aren't available. What happens now? It could return an error, which makes some sense. It could open a new file handle, but it would be looking at the mtime=6 file. The mtime=6 file might have different offsets, so it might be reading garbage. A lot of this is hypothetical, but it would be nice for a ScanRange to deal with a single version of a file. That is a property that the old code had. Line 98: static const uint32_t HASH_SEED = 0x87654321; > where from? No need for a non-zero hash seed for our purpose, so I have removed this and substituted zero in the code. http://gerrit.cloudera.org:8080/#/c/6478/8/be/src/runtime/disk-io-mgr-handle-cache.inline.h File be/src/runtime/disk-io-mgr-handle-cache.inline.h: Line 65: // Keep going until find unused entry with the same mtime > fix grammar Done Line 113: // to its place in the map > add it? I started working on this. The templating starts to infect everything. I will keep looking into it for a followup, but I think I will leave this as a TODO for the first version. Line 117: if (elem->fh.get() == fh) { > single line Done Line 129: // File can be unbuffered > where does the corresponding buffering take place? Added information in the comment. The buffering here is readahead buffering that Hdfs does for a read. That is what we are unbuffering. http://gerrit.cloudera.org:8080/#/c/6478/8/be/src/runtime/disk-io-mgr-scan-range.cc File be/src/runtime/disk-io-mgr-scan-range.cc: Line 417: int last_read = -1; > what does this mean? does it count anything, and if so, what? This is the number of bytes read by either hdfsPread or hdfsRead. In case of an error, it returns -1. I changed the name to "last_bytes_read" Line 429: return Status(GetHdfsErrorMsg("Error seeking HDFS file: ", file_)); > also print position Done Line 435: return Status(GetHdfsErrorMsg("Error reading from HDFS file: ", file_)); > leave todo that you need to retry once to deal with stale cache entries Done Line 549: remote_bytes_ += stats->totalBytesRead - stats->totalLocalBytesRead; > why can't remote_bytes_ go into reader_? The only real obstacle is this piece of logging: https://github.com/apache/incubator-impala/blob/master/be/src/runtime/disk-io-mgr-scan-range.cc#L341 (It is in a different location after this code change.) The logging is printing the number of unexpected remote bytes for this scan range. If the remote bytes go the reader_, I don't have a way to separate it out. If we don't need the logging, then the state on the ScanRange can be a boolean, and we can send remote bytes directly to reader_. http://gerrit.cloudera.org:8080/#/c/6478/8/be/src/runtime/disk-io-mgr.cc File be/src/runtime/disk-io-mgr.cc: Line 1300 > this comment disappeared. Added a comment in the new location. Hdfs can do some readahead buffering, so we want to reduce this buffering when putting a file handle in the cache. Line 278: cached_read_options_(nullptr), > move into .h file and remove here Done http://gerrit.cloudera.org:8080/#/c/6478/8/be/src/runtime/disk-io-mgr.h File be/src/runtime/disk-io-mgr.h: Line 498: int64_t remote_bytes_; > num_... Done http://gerrit.cloudera.org:8080/#/c/6478/8/tests/query_test/test_hdfs_fd_caching.py File tests/query_test/test_hdfs_fd_caching.py: Line 74: self.execute_query_expect_success(self.client, count_query) > this should give some expected result Done Line 99: self.execute_query_expect_success(self.client, count_query) > you should check for the expected result, depending on the modification typ Done -- To view, visit http://gerrit.cloudera.org:8080/6478 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: Ibe5ff60971dd653c3b6a0e13928cfa9fc59d078d Gerrit-PatchSet: 8 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Joe McDonnell Gerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Joe McDonnell Gerrit-Reviewer: Marcel Kornacker Gerrit-Reviewer: Tim Armstrong Gerrit-HasComments: Yes