hadoop-hdfs-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Mark Christiaens (JIRA)" <j...@apache.org>
Subject [jira] [Created] (HDFS-13604) DistCp filtering conflicts with snapshotting
Date Wed, 23 May 2018 09:06:00 GMT
Mark Christiaens created HDFS-13604:
---------------------------------------

             Summary: DistCp filtering conflicts with snapshotting
                 Key: HDFS-13604
                 URL: https://issues.apache.org/jira/browse/HDFS-13604
             Project: Hadoop HDFS
          Issue Type: Bug
          Components: distcp
            Reporter: Mark Christiaens


DistCp has an option to filter (not copy) files that match one of the file patterns in a file. 
DistCp also has options where it optimizes incremental copying based on snapshots present
at the source and target location.  When enabling both options, files that should be copied
from source to target are missing on the target.

To reproduce the issue:
 * Create two directories, {{source}} and {{target}}.
 * In {{source}}, put two files, {{A}} and {{B}}, with some random content.
 * Create a filter file that filters {{A}} (so blocks copying {{A}}).
 * Create a snapshot, {{snapshot_old}}, of the {{source}} directory.
 * Use {{distcp}} to copy the content of {{source}} to {{target}}.
 * As expected, the {{target}} directory will contain only file {{B}}.  {{A}} is filtered.
 * Take a snapshot of the target directory, snapshot_old.
 * In the {{source}} directory, rename {{A}} to {{C}}.
 * Take a new snapshot of the source directory, {{snapshot_new}}.
 * Now, perform an incremental {{distcp}} copy using the created snapshots so as to optimize
the incremental copy process: {{distcp -update -filters filters.txt -diff snapshot_old snapshot_new ... ...}}
 * You will find that the newly created file {{C}} is not copied to the {{target}} directory.

I suspect that the reason for this is that {{distcp}} concludes from analyzing the difference
between {{snapshot_source}} and {{snapshot_source_new}} that {{A}} was renamed to {{C}}. This
can be confirmed by using {{snapshotDiff}} to compare the two snapshot:  it reports that
{{A}} has been renamed to {{C}}.

{{distcp}} seems to then assume that the data for {{C}} is already present in the {{target}}
directory and only needs to be renamed.  However, due to the filtering, {{A}} is {{not}} present
on the target and cannot be renamed to {{C}}.

Although the final {{distcp}} fails to create a copy of the {{C}} file in the {{target}} directory,
{{distcp}} does not report any failure, nor can I find any trace of errors in the job logs
of the jobs created by {{distcp}} to execute the actual copy.

So, some options:
 * Combining {{-diff}} and {{-filters}} could be disallowed.
 * {{distcp}} could assume that files that have been filtered are _not_ present and should
be replicated in ordinary fashion.

 

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-dev-unsubscribe@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-help@hadoop.apache.org


Mime
View raw message