flume-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "ASF GitHub Bot (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (FLUME-2922) HDFSSequenceFile Should Sync Writer
Date Thu, 09 Jun 2016 19:51:21 GMT

    [ https://issues.apache.org/jira/browse/FLUME-2922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15323202#comment-15323202

ASF GitHub Bot commented on FLUME-2922:

GitHub user kevinconaway opened a pull request:


    FLUME-2922 Sync SequenceFile.Writer before calling hflush

    @harishreedharan will you please review?

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/kevinconaway/flume flume-2922

Alternatively you can review and apply these changes as the patch at:


To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #52
commit f03a2406bf44a8300522c1941293e2d74df88d28
Author: Kevin Conaway <kevin.conaway@walmart.com>
Date:   2016-06-09T19:50:13Z

    FLUME-2922 Sync SequenceFile.Writer before calling hflush


> HDFSSequenceFile Should Sync Writer
> -----------------------------------
>                 Key: FLUME-2922
>                 URL: https://issues.apache.org/jira/browse/FLUME-2922
>             Project: Flume
>          Issue Type: Bug
>          Components: Sinks+Sources
>    Affects Versions: v1.6.0
>            Reporter: Kevin Conaway
>            Priority: Critical
> There is a possibility of losing data with the current HDFS sequence file writer.
> Internally, the `SequenceFile.Writer` buffers data and periodically syncs it to the underlying
output stream.  The mechanism for doing this is dependent on whether you are using compression
or not but in both scenarios, the key/values are appended to an internal buffer and only flushed
to disk after the buffer reaches a certain size.
> Thus it is quite possible for Flume to lose messages if the agent crashes, or is stopped,
before the internal buffer is flushed to disk.
> The correct action is to force the writer to sync its internal buffers to the underlying
`FSDataOutputStream` first before calling hflush/sync.
> Additionally, I believe we should be calling hsync instead of hflush.  Its my understanding
writes with hsync should be more durable which I believe are the semantics we want here.

This message was sent by Atlassian JIRA

View raw message