flume-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "ASF GitHub Bot (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (FLUME-3080) Close failure in HDFS Sink might cause data loss
Date Mon, 03 Apr 2017 15:35:41 GMT

    [ https://issues.apache.org/jira/browse/FLUME-3080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15953676#comment-15953676
] 

ASF GitHub Bot commented on FLUME-3080:
---------------------------------------

GitHub user adenes opened a pull request:

    https://github.com/apache/flume/pull/127

    FLUME-3080. Call DistributedFileSystem.recoverLease() if close() fails to avoid lease
leak

    If the HDFS Sink tries to close a file but it fails (e.g. due to timeout) the last block
might not end up in COMPLETE state. In this case block recovery should happen but as the lease
is still held by Flume the NameNode will start the recovery process only after the hard limit
of 1 hour expires.
    
    This change adds an explicit recoverLease() call in case of close failure.
    
    For more details see https://issues.apache.org/jira/browse/FLUME-3080

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/adenes/flume FLUME-3080

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/flume/pull/127.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #127
    
----
commit 8cc9082c69ad0aea2e8cfa20e906261a6a417245
Author: Denes Arvay <denes@cloudera.com>
Date:   2017-04-03T15:27:19Z

    FLUME-3080. call DistributedFileSystem.recoverLease() if close() fails to avoid lease
leak
    
    If the HDFS Sink tries to close a file but it fails (e.g. due to timeout) the last block
might
    not end up in COMPLETE state. In this case block recovery should happen but as the lease
is
    still held by Flume the NameNode will start the recovery process only after the hard limit
of
    1 hour expires.
    
    This change adds an explicit recoverLease() call in case of close failure.

----


> Close failure in HDFS Sink might cause data loss
> ------------------------------------------------
>
>                 Key: FLUME-3080
>                 URL: https://issues.apache.org/jira/browse/FLUME-3080
>             Project: Flume
>          Issue Type: Bug
>          Components: Sinks+Sources
>    Affects Versions: 1.7.0
>            Reporter: Denes Arvay
>            Assignee: Denes Arvay
>            Priority: Blocker
>
> If the HDFS Sink tries to close a file but it fails (e.g. due to timeout) the last block
might not end up in COMPLETE state. In this case block recovery should happen but as the lease
is still held by Flume the NameNode will start the recovery process only after the hard limit
of 1 hour expires.
> The lease recovery can be started manually by the {{hdfs debug recoverLease}} command.
> For reproduction I removed the close call from the {{BucketWriter}} (https://github.com/apache/flume/blob/trunk/flume-ng-sinks/flume-hdfs-sink/src/main/java/org/apache/flume/sink/hdfs/BucketWriter.java#L380)
to simulate the failure and used the following config:
> {noformat}
> agent.sources = testsrc
> agent.sinks = testsink
> agent.channels = testch
> agent.sources.testsrc.type = netcat
> agent.sources.testsrc.bind = localhost
> agent.sources.testsrc.port = 9494
> agent.sources.testsrc.channels = testch
> agent.sinks.testsink.type = hdfs
> agent.sinks.testsink.hdfs.path = /user/flume/test
> agent.sinks.testsink.hdfs.rollInterval = 0
> agent.sinks.testsink.hdfs.rollCount = 3
> agent.sinks.testsink.serializer = avro_event
> agent.sinks.testsink.serializer.compressionCodec = snappy
> agent.sinks.testsink.hdfs.fileSuffix = .avro
> agent.sinks.testsink.hdfs.fileType = DataStream
> agent.sinks.testsink.hdfs.batchSize = 2
> agent.sinks.testsink.hdfs.writeFormat = Text
> agent.sinks.testsink.hdfs.idleTimeout=20
> agent.sinks.testsink.channel = testch
> agent.channels.testch.type = memory
> {noformat}
> After ingesting 6 events ("a" - "f") 2 files were created on HDFS, as expected. But there
are missing events when listing the contents in Spark shell:
> {noformat}
> scala> sqlContext.read.avro("/user/flume/test/FlumeData.14908867127*.avro").collect().map(a
=> new String(a(1).asInstanceOf[Array[Byte]])).foreach(println)
> a
> b
> d
> {noformat}
> {{hdfs fsck}} also confirms that the blocks are still in {{UNDER_CONSTRUCTION}} state:
> {noformat}
> $ hdfs fsck /user/flume/test/ -openforwrite -files -blocks
> FSCK started by root (auth:SIMPLE) from /172.31.114.3 for path /user/flume/test/ at Thu
Mar 30 08:23:56 PDT 2017
> /user/flume/test/ <dir>
> /user/flume/test/FlumeData.1490887185312.avro 310 bytes, 1 block(s), OPENFORWRITE:  MISSING
1 blocks of total size 310 B
> 0. BP-1285398861-172.31.114.3-1489845696835:blk_1073761923_21128{blockUCState=UNDER_CONSTRUCTION,
primaryNodeIndex=-1, replicas=[ReplicaUnderConstruction[[DISK]DS-e0d04bef-a861-40b0-99dd-27bfb2871ecd:NORMAL:172.31.114.27:20002|RBW],
ReplicaUnderConstruction[[DISK]DS-d1979e0c-db81-4790-b225-ae8a4cf42dd8:NORMAL:172.31.114.32:20002|RBW],
ReplicaUnderConstruction[[DISK]DS-ca00550d-702e-4892-a54a-7105af0c19ee:NORMAL:172.31.114.24:20002|RBW]]}
len=310 MISSING!
> /user/flume/test/FlumeData.1490887185313.avro 292 bytes, 1 block(s), OPENFORWRITE:  MISSING
1 blocks of total size 292 B
> 0. BP-1285398861-172.31.114.3-1489845696835:blk_1073761924_21129{blockUCState=UNDER_CONSTRUCTION,
primaryNodeIndex=-1, replicas=[ReplicaUnderConstruction[[DISK]DS-ca00550d-702e-4892-a54a-7105af0c19ee:NORMAL:172.31.114.24:20002|RBW],
ReplicaUnderConstruction[[DISK]DS-e0d04bef-a861-40b0-99dd-27bfb2871ecd:NORMAL:172.31.114.27:20002|RBW],
ReplicaUnderConstruction[[DISK]DS-d1979e0c-db81-4790-b225-ae8a4cf42dd8:NORMAL:172.31.114.32:20002|RBW]]}
len=292 MISSING!
> {noformat}
> These blocks need to be recovered by starting a lease recovery process on the NameNode
(which will then run the block recovery as well). This can be triggered programmatically via
the DFSClient.
> Adding this call if the close fails solves the issue.
> cc [~jojochuang]



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Mime
View raw message