Mailing-List: contact dev-help@flume.apache.org; run by ezmlm
Precedence: bulk
Reply-To: dev@flume.apache.org
Date: Mon, 3 Apr 2017 15:35:41 +0000 (UTC)
From: "ASF GitHub Bot (JIRA)" <jira@apache.org>
To: dev@flume.apache.org
Message-ID: <JIRA.13060637.1490953971000.192106.1491233741691@Atlassian.JIRA>
In-Reply-To: <JIRA.13060637.1490953971000@Atlassian.JIRA>
References: <JIRA.13060637.1490953971000@Atlassian.JIRA> <JIRA.13060637.1490953971534@jira-lw-us.apache.org>
Subject: [jira] [Commented] (FLUME-3080) Close failure in HDFS Sink might
 cause data loss
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit
archived-at: Mon, 03 Apr 2017 15:35:48 -0000


    [ https://issues.apache.org/jira/browse/FLUME-3080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15953676#comment-15953676 ] 

ASF GitHub Bot commented on FLUME-3080:
---------------------------------------

GitHub user adenes opened a pull request:

    https://github.com/apache/flume/pull/127

    FLUME-3080. Call DistributedFileSystem.recoverLease() if close() fails to avoid lease leak

    If the HDFS Sink tries to close a file but it fails (e.g. due to timeout) the last block might not end up in COMPLETE state. In this case block recovery should happen but as the lease is still held by Flume the NameNode will start the recovery process only after the hard limit of 1 hour expires.
    
    This change adds an explicit recoverLease() call in case of close failure.
    
    For more details see https://issues.apache.org/jira/browse/FLUME-3080

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/adenes/flume FLUME-3080

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/flume/pull/127.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #127
    
----
commit 8cc9082c69ad0aea2e8cfa20e906261a6a417245
Author: Denes Arvay <denes@cloudera.com>
Date:   2017-04-03T15:27:19Z

    FLUME-3080. call DistributedFileSystem.recoverLease() if close() fails to avoid lease leak
    
    If the HDFS Sink tries to close a file but it fails (e.g. due to timeout) the last block might
    not end up in COMPLETE state. In this case block recovery should happen but as the lease is
    still held by Flume the NameNode will start the recovery process only after the hard limit of
    1 hour expires.
    
    This change adds an explicit recoverLease() call in case of close failure.

----


> Close failure in HDFS Sink might cause data loss
> ------------------------------------------------
>
>                 Key: FLUME-3080
>                 URL: https://issues.apache.org/jira/browse/FLUME-3080
>             Project: Flume
>          Issue Type: Bug
>          Components: Sinks+Sources
>    Affects Versions: 1.7.0
>            Reporter: Denes Arvay
>            Assignee: Denes Arvay
>            Priority: Blocker
>
> If the HDFS Sink tries to close a file but it fails (e.g. due to timeout) the last block might not end up in COMPLETE state. In this case block recovery should happen but as the lease is still held by Flume the NameNode will start the recovery process only after the hard limit of 1 hour expires.
> The lease recovery can be started manually by the {{hdfs debug recoverLease}} command.
> For reproduction I removed the close call from the {{BucketWriter}} (https://github.com/apache/flume/blob/trunk/flume-ng-sinks/flume-hdfs-sink/src/main/java/org/apache/flume/sink/hdfs/BucketWriter.java#L380) to simulate the failure and used the following config:
> {noformat}
> agent.sources = testsrc
> agent.sinks = testsink
> agent.channels = testch
> agent.sources.testsrc.type = netcat
> agent.sources.testsrc.bind = localhost
> agent.sources.testsrc.port = 9494
> agent.sources.testsrc.channels = testch
> agent.sinks.testsink.type = hdfs
> agent.sinks.testsink.hdfs.path = /user/flume/test
> agent.sinks.testsink.hdfs.rollInterval = 0
> agent.sinks.testsink.hdfs.rollCount = 3
> agent.sinks.testsink.serializer = avro_event
> agent.sinks.testsink.serializer.compressionCodec = snappy
> agent.sinks.testsink.hdfs.fileSuffix = .avro
> agent.sinks.testsink.hdfs.fileType = DataStream
> agent.sinks.testsink.hdfs.batchSize = 2
> agent.sinks.testsink.hdfs.writeFormat = Text
> agent.sinks.testsink.hdfs.idleTimeout=20
> agent.sinks.testsink.channel = testch
> agent.channels.testch.type = memory
> {noformat}
> After ingesting 6 events ("a" - "f") 2 files were created on HDFS, as expected. But there are missing events when listing the contents in Spark shell:
> {noformat}
> scala> sqlContext.read.avro("/user/flume/test/FlumeData.14908867127*.avro").collect().map(a => new String(a(1).asInstanceOf[Array[Byte]])).foreach(println)
> a
> b
> d
> {noformat}
> {{hdfs fsck}} also confirms that the blocks are still in {{UNDER_CONSTRUCTION}} state:
> {noformat}
> $ hdfs fsck /user/flume/test/ -openforwrite -files -blocks
> FSCK started by root (auth:SIMPLE) from /172.31.114.3 for path /user/flume/test/ at Thu Mar 30 08:23:56 PDT 2017
> /user/flume/test/ <dir>
> /user/flume/test/FlumeData.1490887185312.avro 310 bytes, 1 block(s), OPENFORWRITE:  MISSING 1 blocks of total size 310 B
> 0. BP-1285398861-172.31.114.3-1489845696835:blk_1073761923_21128{blockUCState=UNDER_CONSTRUCTION, primaryNodeIndex=-1, replicas=[ReplicaUnderConstruction[[DISK]DS-e0d04bef-a861-40b0-99dd-27bfb2871ecd:NORMAL:172.31.114.27:20002|RBW], ReplicaUnderConstruction[[DISK]DS-d1979e0c-db81-4790-b225-ae8a4cf42dd8:NORMAL:172.31.114.32:20002|RBW], ReplicaUnderConstruction[[DISK]DS-ca00550d-702e-4892-a54a-7105af0c19ee:NORMAL:172.31.114.24:20002|RBW]]} len=310 MISSING!
> /user/flume/test/FlumeData.1490887185313.avro 292 bytes, 1 block(s), OPENFORWRITE:  MISSING 1 blocks of total size 292 B
> 0. BP-1285398861-172.31.114.3-1489845696835:blk_1073761924_21129{blockUCState=UNDER_CONSTRUCTION, primaryNodeIndex=-1, replicas=[ReplicaUnderConstruction[[DISK]DS-ca00550d-702e-4892-a54a-7105af0c19ee:NORMAL:172.31.114.24:20002|RBW], ReplicaUnderConstruction[[DISK]DS-e0d04bef-a861-40b0-99dd-27bfb2871ecd:NORMAL:172.31.114.27:20002|RBW], ReplicaUnderConstruction[[DISK]DS-d1979e0c-db81-4790-b225-ae8a4cf42dd8:NORMAL:172.31.114.32:20002|RBW]]} len=292 MISSING!
> {noformat}
> These blocks need to be recovered by starting a lease recovery process on the NameNode (which will then run the block recovery as well). This can be triggered programmatically via the DFSClient.
> Adding this call if the close fails solves the issue.
> cc [~jojochuang]


--
This message was sent by Atlassian JIRA
(v6.3.15#6346)