hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Chris Douglas (JIRA)" <j...@apache.org>
Subject [jira] Updated: (HADOOP-2725) Distcp truncates some files when copying
Date Tue, 12 Feb 2008 22:01:21 GMT

     [ https://issues.apache.org/jira/browse/HADOOP-2725?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Chris Douglas updated HADOOP-2725:
----------------------------------

    Resolution: Fixed
        Status: Resolved  (was: Patch Available)

I just committed this. Thanks, Nicholas!

> Distcp truncates some files when copying
> ----------------------------------------
>
>                 Key: HADOOP-2725
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2725
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: dfs, util
>    Affects Versions: 0.16.0
>         Environment: Nightly build: http://hadoopqa.yst.corp.yahoo.com:8080/hudson/job/Hadoop-LinuxTest/770/
> With patches for HADOOP-2095 and HADOOP-2119.
>            Reporter: Murtaza A. Basrai
>            Assignee: Tsz Wo (Nicholas), SZE
>            Priority: Critical
>             Fix For: 0.16.1
>
>         Attachments: 2725_20080206.patch, 2725_20080208.patch, 2725_20080212.patch
>
>
> We used distcp to copy ~100 TB of data across two clusters ~1400 nodes each.
> Command used (it was run on the src cluster):
> hadoop distcp -log /logdir/logfile hdfs://src-namenode:8600//src-dir-1 hdfs://src-namenode:8600//src-dir-2
... hdfs://src-namenode:8600//src-dir-n hdfs://tgt-namenode:8600//dst-dir
> Distcp completed without errors, but when we checked the file sizes on the src and tgt
clusters, we noticed differences in file sizes for 9 files (~6 GB).
> src-file-1 666762714 bytes -> tgt-file-1 134217728 bytes
> src-file-2 673791814 bytes -> tgt-file-2 536870912 bytes
> src-file-3 692172075 bytes -> tgt-file-3 0 bytes
> All target files are truncated at block boundaries (some have 0 size).
> I looked at the log files, and noticed a few things:
> 1. There are 31059 log files (same as the number of Maps the job had).
> 2. 246 of the log files are non-empty.
> 3. All non-empty log files are of the form:
> SKIP: hdfs://src-namenode/src-dir-a/src-file-x
> SKIP: hdfs://src-namenode/src-dir-b/src-file-y
> SKIP: hdfs://src-namenode/src-dir-c/src-file-z
> 4. All 9 files which were truncated were included in the log files as skipped files.
> 5. All 9 files were the last entry in their respective log files.
> e.g.
> Non-empty logfile 1:
> SKIP: hdfs://src-namenode/src-dir-a/src-file-x
> SKIP: hdfs://src-namenode/src-dir-b/src-file-y
> SKIP: hdfs://src-namenode/src-dir-c/src-file-z  <-- Truncated file
> Non_empty logfile 2:
> SKIP: hdfs://src-namenode/src-dir-p/src-file-m
> SKIP: hdfs://src-namenode/src-dir-q/src-file-n  <-- Truncated file

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message