hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sameer Paranjpye (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HADOOP-2725) Distcp truncates some files when copying
Date Wed, 06 Feb 2008 19:21:09 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-2725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12566284#action_12566284

Sameer Paranjpye commented on HADOOP-2725:

> Shell we make "-update" a default option? It is like cp in unix, i.e. cp overwrite files
by default.

I'd rather not make it the default. It's too easy to clobber vast amounts of data inadvertently.

>From the discussion above, it seems like the problem is that partial copies aren't clearly
distinguishable from successfully copied inputs. One has to compare the source and destination
lists by name and size to determine the set of unsuccessful copies. The use to temporary filenames
should make it easier find partial copies.

Another enhancement that would help is distcp deleting partially copied files at the destination.

> Distcp truncates some files when copying
> ----------------------------------------
>                 Key: HADOOP-2725
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2725
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: dfs, util
>    Affects Versions: 0.16.0
>         Environment: Nightly build: http://hadoopqa.yst.corp.yahoo.com:8080/hudson/job/Hadoop-LinuxTest/770/
> With patches for HADOOP-2095 and HADOOP-2119.
>            Reporter: Murtaza A. Basrai
>            Assignee: Tsz Wo (Nicholas), SZE
>            Priority: Critical
>             Fix For: 0.16.1
> We used distcp to copy ~100 TB of data across two clusters ~1400 nodes each.
> Command used (it was run on the src cluster):
> hadoop distcp -log /logdir/logfile hdfs://src-namenode:8600//src-dir-1 hdfs://src-namenode:8600//src-dir-2
... hdfs://src-namenode:8600//src-dir-n hdfs://tgt-namenode:8600//dst-dir
> Distcp completed without errors, but when we checked the file sizes on the src and tgt
clusters, we noticed differences in file sizes for 9 files (~6 GB).
> src-file-1 666762714 bytes -> tgt-file-1 134217728 bytes
> src-file-2 673791814 bytes -> tgt-file-2 536870912 bytes
> src-file-3 692172075 bytes -> tgt-file-3 0 bytes
> All target files are truncated at block boundaries (some have 0 size).
> I looked at the log files, and noticed a few things:
> 1. There are 31059 log files (same as the number of Maps the job had).
> 2. 246 of the log files are non-empty.
> 3. All non-empty log files are of the form:
> SKIP: hdfs://src-namenode/src-dir-a/src-file-x
> SKIP: hdfs://src-namenode/src-dir-b/src-file-y
> SKIP: hdfs://src-namenode/src-dir-c/src-file-z
> 4. All 9 files which were truncated were included in the log files as skipped files.
> 5. All 9 files were the last entry in their respective log files.
> e.g.
> Non-empty logfile 1:
> SKIP: hdfs://src-namenode/src-dir-a/src-file-x
> SKIP: hdfs://src-namenode/src-dir-b/src-file-y
> SKIP: hdfs://src-namenode/src-dir-c/src-file-z  <-- Truncated file
> Non_empty logfile 2:
> SKIP: hdfs://src-namenode/src-dir-p/src-file-m
> SKIP: hdfs://src-namenode/src-dir-q/src-file-n  <-- Truncated file

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message