hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Steve Loughran (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HDFS-13660) DistCp job fails when new data is appended in the file while the distCp copy job is running
Date Fri, 08 Jun 2018 18:10:00 GMT

    [ https://issues.apache.org/jira/browse/HDFS-13660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16506347#comment-16506347

Steve Loughran commented on HDFS-13660:

interesting. But at least it failed...a bigger risk would be if the file was changed to a
new file of the same size...if the read crossed a block boundary, you could end up with a
mix of the old and new data. You'd be hard pressed to safely identify the problem, other than
by comparing the source checksum before the upload began with the source checksum after it
had finished

# I think the first step here would be to document what you must not do while an upload is
in progress: append/replace files
# longer term: if, after an upload, identify when the source has changed, warn and maybe repeat
the upload. That'd be with a checksum on HDFS; modified timestamp elsewhere

> DistCp job fails when new data is appended in the file while the distCp copy job is running
> -------------------------------------------------------------------------------------------
>                 Key: HDFS-13660
>                 URL: https://issues.apache.org/jira/browse/HDFS-13660
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: distcp
>            Reporter: Mukund Thakur
>            Assignee: Mukund Thakur
>            Priority: Critical
>         Attachments: distcp_failure_when_file_append.log
> Steps to reproduce: 
> Suppose distcp MR job is copying the file /tmp/web_returns_merged/data-m-002 and 
> we append some more data to this file using command 
> hadoop fs -appendToFile xaa  /tmp/web_returns_merged/data-m-002
> the job fails with exception 
>  Mismatch in length of source:hdfs://mycluster0/tmp/web_returns_merged/data-m-002 and
> Attached the logs.

This message was sent by Atlassian JIRA

To unsubscribe, e-mail: hdfs-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-help@hadoop.apache.org

View raw message