Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 4009F200C3C for ; Mon, 3 Apr 2017 17:35:48 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id 3E99F160B8F; Mon, 3 Apr 2017 15:35:48 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 5E350160B76 for ; Mon, 3 Apr 2017 17:35:47 +0200 (CEST) Received: (qmail 83893 invoked by uid 500); 3 Apr 2017 15:35:46 -0000 Mailing-List: contact dev-help@flume.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@flume.apache.org Delivered-To: mailing list dev@flume.apache.org Received: (qmail 83882 invoked by uid 99); 3 Apr 2017 15:35:46 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd3-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 03 Apr 2017 15:35:46 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd3-us-west.apache.org (ASF Mail Server at spamd3-us-west.apache.org) with ESMTP id DDB4918063D for ; Mon, 3 Apr 2017 15:35:45 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd3-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: -100.002 X-Spam-Level: X-Spam-Status: No, score=-100.002 tagged_above=-999 required=6.31 tests=[RP_MATCHES_RCVD=-0.001, SPF_PASS=-0.001, USER_IN_WHITELIST=-100] autolearn=disabled Received: from mx1-lw-eu.apache.org ([10.40.0.8]) by localhost (spamd3-us-west.apache.org [10.40.0.10]) (amavisd-new, port 10024) with ESMTP id GNNCPAIjSHqx for ; Mon, 3 Apr 2017 15:35:44 +0000 (UTC) Received: from mailrelay1-us-west.apache.org (mailrelay1-us-west.apache.org [209.188.14.139]) by mx1-lw-eu.apache.org (ASF Mail Server at mx1-lw-eu.apache.org) with ESMTP id 2C8005FE48 for ; Mon, 3 Apr 2017 15:35:43 +0000 (UTC) Received: from jira-lw-us.apache.org (unknown [207.244.88.139]) by mailrelay1-us-west.apache.org (ASF Mail Server at mailrelay1-us-west.apache.org) with ESMTP id 5017FE0BCD for ; Mon, 3 Apr 2017 15:35:42 +0000 (UTC) Received: from jira-lw-us.apache.org (localhost [127.0.0.1]) by jira-lw-us.apache.org (ASF Mail Server at jira-lw-us.apache.org) with ESMTP id A95282401C for ; Mon, 3 Apr 2017 15:35:41 +0000 (UTC) Date: Mon, 3 Apr 2017 15:35:41 +0000 (UTC) From: "ASF GitHub Bot (JIRA)" To: dev@flume.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (FLUME-3080) Close failure in HDFS Sink might cause data loss MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 archived-at: Mon, 03 Apr 2017 15:35:48 -0000 [ https://issues.apache.org/jira/browse/FLUME-3080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15953676#comment-15953676 ] ASF GitHub Bot commented on FLUME-3080: --------------------------------------- GitHub user adenes opened a pull request: https://github.com/apache/flume/pull/127 FLUME-3080. Call DistributedFileSystem.recoverLease() if close() fails to avoid lease leak If the HDFS Sink tries to close a file but it fails (e.g. due to timeout) the last block might not end up in COMPLETE state. In this case block recovery should happen but as the lease is still held by Flume the NameNode will start the recovery process only after the hard limit of 1 hour expires. This change adds an explicit recoverLease() call in case of close failure. For more details see https://issues.apache.org/jira/browse/FLUME-3080 You can merge this pull request into a Git repository by running: $ git pull https://github.com/adenes/flume FLUME-3080 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/flume/pull/127.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #127 ---- commit 8cc9082c69ad0aea2e8cfa20e906261a6a417245 Author: Denes Arvay Date: 2017-04-03T15:27:19Z FLUME-3080. call DistributedFileSystem.recoverLease() if close() fails to avoid lease leak If the HDFS Sink tries to close a file but it fails (e.g. due to timeout) the last block might not end up in COMPLETE state. In this case block recovery should happen but as the lease is still held by Flume the NameNode will start the recovery process only after the hard limit of 1 hour expires. This change adds an explicit recoverLease() call in case of close failure. ---- > Close failure in HDFS Sink might cause data loss > ------------------------------------------------ > > Key: FLUME-3080 > URL: https://issues.apache.org/jira/browse/FLUME-3080 > Project: Flume > Issue Type: Bug > Components: Sinks+Sources > Affects Versions: 1.7.0 > Reporter: Denes Arvay > Assignee: Denes Arvay > Priority: Blocker > > If the HDFS Sink tries to close a file but it fails (e.g. due to timeout) the last block might not end up in COMPLETE state. In this case block recovery should happen but as the lease is still held by Flume the NameNode will start the recovery process only after the hard limit of 1 hour expires. > The lease recovery can be started manually by the {{hdfs debug recoverLease}} command. > For reproduction I removed the close call from the {{BucketWriter}} (https://github.com/apache/flume/blob/trunk/flume-ng-sinks/flume-hdfs-sink/src/main/java/org/apache/flume/sink/hdfs/BucketWriter.java#L380) to simulate the failure and used the following config: > {noformat} > agent.sources = testsrc > agent.sinks = testsink > agent.channels = testch > agent.sources.testsrc.type = netcat > agent.sources.testsrc.bind = localhost > agent.sources.testsrc.port = 9494 > agent.sources.testsrc.channels = testch > agent.sinks.testsink.type = hdfs > agent.sinks.testsink.hdfs.path = /user/flume/test > agent.sinks.testsink.hdfs.rollInterval = 0 > agent.sinks.testsink.hdfs.rollCount = 3 > agent.sinks.testsink.serializer = avro_event > agent.sinks.testsink.serializer.compressionCodec = snappy > agent.sinks.testsink.hdfs.fileSuffix = .avro > agent.sinks.testsink.hdfs.fileType = DataStream > agent.sinks.testsink.hdfs.batchSize = 2 > agent.sinks.testsink.hdfs.writeFormat = Text > agent.sinks.testsink.hdfs.idleTimeout=20 > agent.sinks.testsink.channel = testch > agent.channels.testch.type = memory > {noformat} > After ingesting 6 events ("a" - "f") 2 files were created on HDFS, as expected. But there are missing events when listing the contents in Spark shell: > {noformat} > scala> sqlContext.read.avro("/user/flume/test/FlumeData.14908867127*.avro").collect().map(a => new String(a(1).asInstanceOf[Array[Byte]])).foreach(println) > a > b > d > {noformat} > {{hdfs fsck}} also confirms that the blocks are still in {{UNDER_CONSTRUCTION}} state: > {noformat} > $ hdfs fsck /user/flume/test/ -openforwrite -files -blocks > FSCK started by root (auth:SIMPLE) from /172.31.114.3 for path /user/flume/test/ at Thu Mar 30 08:23:56 PDT 2017 > /user/flume/test/ > /user/flume/test/FlumeData.1490887185312.avro 310 bytes, 1 block(s), OPENFORWRITE: MISSING 1 blocks of total size 310 B > 0. BP-1285398861-172.31.114.3-1489845696835:blk_1073761923_21128{blockUCState=UNDER_CONSTRUCTION, primaryNodeIndex=-1, replicas=[ReplicaUnderConstruction[[DISK]DS-e0d04bef-a861-40b0-99dd-27bfb2871ecd:NORMAL:172.31.114.27:20002|RBW], ReplicaUnderConstruction[[DISK]DS-d1979e0c-db81-4790-b225-ae8a4cf42dd8:NORMAL:172.31.114.32:20002|RBW], ReplicaUnderConstruction[[DISK]DS-ca00550d-702e-4892-a54a-7105af0c19ee:NORMAL:172.31.114.24:20002|RBW]]} len=310 MISSING! > /user/flume/test/FlumeData.1490887185313.avro 292 bytes, 1 block(s), OPENFORWRITE: MISSING 1 blocks of total size 292 B > 0. BP-1285398861-172.31.114.3-1489845696835:blk_1073761924_21129{blockUCState=UNDER_CONSTRUCTION, primaryNodeIndex=-1, replicas=[ReplicaUnderConstruction[[DISK]DS-ca00550d-702e-4892-a54a-7105af0c19ee:NORMAL:172.31.114.24:20002|RBW], ReplicaUnderConstruction[[DISK]DS-e0d04bef-a861-40b0-99dd-27bfb2871ecd:NORMAL:172.31.114.27:20002|RBW], ReplicaUnderConstruction[[DISK]DS-d1979e0c-db81-4790-b225-ae8a4cf42dd8:NORMAL:172.31.114.32:20002|RBW]]} len=292 MISSING! > {noformat} > These blocks need to be recovered by starting a lease recovery process on the NameNode (which will then run the block recovery as well). This can be triggered programmatically via the DFSClient. > Adding this call if the close fails solves the issue. > cc [~jojochuang] -- This message was sent by Atlassian JIRA (v6.3.15#6346)