From common-issues-return-167124-archive-asf-public=cust-asf.ponee.io@hadoop.apache.org Fri Mar 8 15:34:06 2019 Return-Path: X-Original-To: archive-asf-public@cust-asf.ponee.io Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by mx-eu-01.ponee.io (Postfix) with SMTP id E81B318067C for ; Fri, 8 Mar 2019 16:34:05 +0100 (CET) Received: (qmail 83812 invoked by uid 500); 8 Mar 2019 15:34:05 -0000 Mailing-List: contact common-issues-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list common-issues@hadoop.apache.org Received: (qmail 83801 invoked by uid 99); 8 Mar 2019 15:34:04 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd1-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 08 Mar 2019 15:34:04 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd1-us-west.apache.org (ASF Mail Server at spamd1-us-west.apache.org) with ESMTP id 91CF3C26A5 for ; Fri, 8 Mar 2019 15:34:04 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd1-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: -109.501 X-Spam-Level: X-Spam-Status: No, score=-109.501 tagged_above=-999 required=6.31 tests=[ENV_AND_HDR_SPF_MATCH=-0.5, KAM_ASCII_DIVIDERS=0.8, RCVD_IN_DNSWL_MED=-2.3, SPF_PASS=-0.001, USER_IN_DEF_SPF_WL=-7.5, USER_IN_WHITELIST=-100] autolearn=disabled Received: from mx1-lw-eu.apache.org ([10.40.0.8]) by localhost (spamd1-us-west.apache.org [10.40.0.7]) (amavisd-new, port 10024) with ESMTP id lMya4hcdfcT0 for ; Fri, 8 Mar 2019 15:34:02 +0000 (UTC) Received: from mailrelay1-us-west.apache.org (mailrelay1-us-west.apache.org [209.188.14.139]) by mx1-lw-eu.apache.org (ASF Mail Server at mx1-lw-eu.apache.org) with ESMTP id D153861147 for ; Fri, 8 Mar 2019 15:34:01 +0000 (UTC) Received: from jira-lw-us.apache.org (unknown [207.244.88.139]) by mailrelay1-us-west.apache.org (ASF Mail Server at mailrelay1-us-west.apache.org) with ESMTP id C96DFE2808 for ; Fri, 8 Mar 2019 15:34:00 +0000 (UTC) Received: from jira-lw-us.apache.org (localhost [127.0.0.1]) by jira-lw-us.apache.org (ASF Mail Server at jira-lw-us.apache.org) with ESMTP id 659FF257E4 for ; Fri, 8 Mar 2019 15:34:00 +0000 (UTC) Date: Fri, 8 Mar 2019 15:34:00 +0000 (UTC) From: "Steve Loughran (JIRA)" To: common-issues@hadoop.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (HADOOP-15999) S3Guard: Better support for out-of-band operations MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/HADOOP-15999?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16787977#comment-16787977 ] Steve Loughran commented on HADOOP-15999: ----------------------------------------- This is ~ready to go in; I've only got one change to the code (see the bottom). I do think we need to be sure that we've got all opportunties for inconsistencies to arise covered, and I'm now considering performance too. h3. Deletions I think the whole of S3Guard is potentially brittle to * OOB deletions: you skip it here, so no worse, but because the S3AInputStream retries on FNFE, so as to "debounce" cached 404s, it's potentially going to retry forever * OOB creation of a file which has a deletion tombstone marker. You are already documenting this, so the next step to think about is code: *Proposed*: write a test to simulate that deletion problem, to see what happens. I'm actually curious now. We ought to have the S3AInputStream retry briefly on that initial GET failing, but only on that initial one. (after setting "fs.s3a.retry.limit" to something low & the interval down to 10ms or so to fail fast) sequences {code} 1. create; delete; open; read -> fail after retry 2. create; open, read, delete, read -> fail fast on second read {code} The StoreStatistics of the filesystem's IGNORED_ERRORS stat will be increased on the ignored error, so on sequence 1 will have increased, whereas on sequence 2 it will not have. If either of these tests don't quite fail as expected, we can disable the tests and continue, at least now with some tests to simulate a condition we don't have a fix for *Proposed* add a JIRA on this for us all to worry about. For both we just need to have some model of how long it takes for debouncing to stabilise. Then in this new check, if an FNFE is raised *and* the check is happening > (modtime+ debounce-delay) then its a real FNFE. h3. Timestamp ordering I'm going to add a new complication here. When you initiate a PUT, AFAIK (and [~Thomas Demoor] should be able to confirm), the modified time is that of the time the PUT began, not when the PUT completed. Which means I can have a workflow of {code} write1 = fs.create(path, true) write2 = fs.create(path, true) write2.close() status = fs.getFileStatus(path) write1.write(128MB of data) write1.close() status2 = fs.getFileStatus(path) assertTrue(status2.getLastModified() < status1.getLastModified()) {code} There's no way we are going to be able to defend against that except by tracking versions in the DDB tables, and the S3a Status including that when known. What we'll have to do then is make sure that this issue is documented today, and for the extension to do tag tracking in S3Guard, it keeps an eye on versions *Proposed*: mention this problem in the docs. Once version tracking goes in to s3guard, we'll need to move the ("is newer than") operator out of this modtime check into somewhere else (proposed: do it in the S3AFileStatus, which will look @ version info if set, falling back to etags). (actually, if version checking is on in the GET, we'd never see the updated file, would we?) h3. Performance impact this is going to reinstate the HEAD on every read, so making non-auth S3Guard a bit slower. We could think about addressing that by moving the checks into the input stream itself. That is: the first GET which returns data will also act as the metadata check. That'd mean the read context will need updating with some "metastoreProcessHeader" callback to invoke on the first GET. *Proposed*: Add a JIRA for this to become an optimization The good news is that because it's reading a file, its only one HTTP HEAD request: no need to do any of the other two directory probes except in the case that the file isn't there. h2. code review h3. ITestS3GuardOutOfBandOperations Check your import ordering: new files are where we should start off with getting things "correct" according to our style rules. > S3Guard: Better support for out-of-band operations > -------------------------------------------------- > > Key: HADOOP-15999 > URL: https://issues.apache.org/jira/browse/HADOOP-15999 > Project: Hadoop Common > Issue Type: Sub-task > Components: fs/s3 > Affects Versions: 3.1.0 > Reporter: Sean Mackrory > Assignee: Gabor Bota > Priority: Major > Attachments: HADOOP-15999-007.patch, HADOOP-15999.001.patch, HADOOP-15999.002.patch, HADOOP-15999.003.patch, HADOOP-15999.004.patch, HADOOP-15999.005.patch, HADOOP-15999.006.patch, HADOOP-15999.008.patch, out-of-band-operations.patch > > > S3Guard was initially done on the premise that a new MetadataStore would be the source of truth, and that it wouldn't provide guarantees if updates were done without using S3Guard. > I've been seeing increased demand for better support for scenarios where operations are done on the data that can't reasonably be done with S3Guard involved. For example: > * A file is deleted using S3Guard, and replaced by some other tool. S3Guard can't tell the difference between the new file and delete / list inconsistency and continues to treat the file as deleted. > * An S3Guard-ed file is overwritten by a longer file by some other tool. When reading the file, only the length of the original file is read. > We could possibly have smarter behavior here by querying both S3 and the MetadataStore (even in cases where we may currently only query the MetadataStore in getFileStatus) and use whichever one has the higher modified time. > This kills the performance boost we currently get in some workloads with the short-circuited getFileStatus, but we could keep it with authoritative mode which should give a larger performance boost. At least we'd get more correctness without authoritative mode and a clear declaration of when we can make the assumptions required to short-circuit the process. If we can't consider S3Guard the source of truth, we need to defer to S3 more. > We'd need to be extra sure of any locality / time zone issues if we start relying on mod_time more directly, but currently we're tracking the modification time as returned by S3 anyway. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org For additional commands, e-mail: common-issues-help@hadoop.apache.org