hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ravi Gummadi (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HADOOP-6072) distcp should place the file distcp_src_files in distributed cache
Date Thu, 18 Jun 2009 18:52:07 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-6072?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12721411#action_12721411

Ravi Gummadi commented on HADOOP-6072:

In general, I think the size of this file distcp_src_files would not consume many hdfs blocks

WIth thousands of nodes in cluster(say 4000), even sqrt of getMaxMapTasks() would be 89(i.e.
sqrt(8000)), which is a big number for replication. Is that still OK fornamenode's perf with
many distcp jobs running parallelly, each creating this file with this many replicas ?

> distcp should place the file distcp_src_files in distributed cache
> ------------------------------------------------------------------
>                 Key: HADOOP-6072
>                 URL: https://issues.apache.org/jira/browse/HADOOP-6072
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: tools/distcp
>    Affects Versions: 0.21.0
>            Reporter: Ravi Gummadi
>             Fix For: 0.21.0
> When large number of files are being copied by distcp, accessing distcp_src_files seems
to be an issue, as all map tasks would be accessing this file. The error message seen is:
> 09/06/16 10:13:16 INFO mapred.JobClient: Task Id : attempt_200906040559_0110_m_003348_0,
Status : FAILED
> java.io.IOException: Could not obtain block: blk_-4229860619941366534_1500174
> file=/mapredsystem/hadoop/mapredsystem/distcp_7fiyvq/_distcp_src_files
>         at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.chooseDataNode(DFSClient.java:1757)
>         at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.blockSeekTo(DFSClient.java:1585)
>         at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.read(DFSClient.java:1712)
>         at java.io.DataInputStream.readFully(DataInputStream.java:178)
>         at java.io.DataInputStream.readFully(DataInputStream.java:152)
>         at org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1450)
>         at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1428)
>         at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1417)
>         at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1412)
>         at org.apache.hadoop.mapred.SequenceFileRecordReader.<init>(SequenceFileRecordReader.java:43)
>         at org.apache.hadoop.tools.DistCp$CopyInputFormat.getRecordReader(DistCp.java:299)
>         at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:336)
>         at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)
>         at org.apache.hadoop.mapred.Child.main(Child.java:170)
> This could be because of HADOOP-6038 and/or HADOOP-4681.
> If distcp places this special file distcp_src_files in distributed cache, that could
solve the problem.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message