hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ivan Leonardi <ivanleona...@gmail.com>
Subject Re: Reduce task progress at 100% but status is still RUNNING
Date Sat, 18 Dec 2010 15:05:01 GMT
Furthermore, the 'syslog' file of a failing reduce task attempt
doesn't show any warning about it, here is the tail of it:
2010-12-18 15:49:29,773 INFO org.apache.hadoop.mapred.Merger: Down to
the last merge-pass, with 60 segments left of total size: 35288842
bytes
2010-12-18 15:49:31,012 INFO org.apache.hadoop.mapred.ReduceTask:
Merged 60 segments, 35288842 bytes to disk to satisfy reduce memory
limit
2010-12-18 15:49:31,013 INFO org.apache.hadoop.mapred.ReduceTask:
Merging 1 files, 35288768 bytes from disk
2010-12-18 15:49:31,013 INFO org.apache.hadoop.mapred.ReduceTask:
Merging 0 segments, 0 bytes from memory into reduce
2010-12-18 15:49:31,013 INFO org.apache.hadoop.mapred.Merger: Merging
1 sorted segments
2010-12-18 15:49:31,016 INFO org.apache.hadoop.mapred.Merger: Down to
the last merge-pass, with 1 segments left of total size: 35288764
bytes
2010-12-18 15:59:47,795 INFO org.apache.hadoop.metrics.jvm.JvmMetrics:
Initializing JVM Metrics with processName=CLEANUP, sessionId=
2010-12-18 15:59:47,899 INFO org.apache.hadoop.mapred.TaskRunner:
Runnning cleanup for the task
2010-12-18 15:59:47,912 INFO org.apache.hadoop.mapred.TaskRunner:
Task:attempt_201012181440_0009_r_000009_0 is done. And is in the
process of commiting
2010-12-18 15:59:47,921 INFO org.apache.hadoop.mapred.TaskRunner: Task
'attempt_201012181440_0009_r_000009_0' done.

2010/12/18 Ivan Leonardi <ivanleonardi@gmail.com>:
> Yes, I'm performing computation on large graphs (3.7M of nodes) and
> set the numReduceTasks to 60. Below there is the code. I'm using a
> grouping function grouper.getGroup(u) that computes the group ID of
> the node having u as ID: the number of groups actually corresponds to
> the numReduceTasks parameter. Am I doing it wrong, maybe? If I use
> e.g. 20 reduce tasks, I got memory problems of Java heap, since a
> reducer task receives all the information about a group (thus in the
> order of 3.7M/${groups_count}).
> /etc/hosts is ok, the hosts knows each other.
>
> public static class IEMapper extends Mapper<LongWritable, Vertex,
> LongWritable, Message> {
>        private Grouper grouper;
>        public void map(LongWritable key, Vertex value, Context context)
> throws IOException, InterruptedException {
>                long u = key.get();
>                long e = value.getEstimate();
>                LongWritable groupId = new LongWritable();
>
>                Message indexMessage = new Message(Global.MSG_INDEX);
>                indexMessage.setPayload(new PairOfLongs(u, e));
>                groupId.set(grouper.getGroup(u));
>                context.write(groupId, indexMessage);
>
>                Message updateMessage = new Message(Global.MSG_UPDATE);
>                updateMessage.setPayload(new TripleOfLongs());
>
>                for (Long v : value.getNeighbors().getAll()) {
>                        groupId.set(grouper.getGroup(v));
>                        updateMessage.getPayload(new TripleOfLongs[0]).set(v,
u, e);
>                        context.write(groupId, updateMessage);
>                        context.progress();
>                }
>        }
> }
>
> public static class IEReducer extends Reducer<LongWritable, Message,
> LongWritable, Vertex> {
>
>        public void reduce(LongWritable key, Iterable<Message> values,
> Context context) throws IOException, InterruptedException {
>                TreeSet<Long> S = new TreeSet<Long>();
>                Hashtable<Long, Set<Long>> D = new Hashtable<Long,
Set<Long>>();
>                Hashtable<Long, Long> est = new Hashtable<Long, Long>();
>
>                long dst, src, u, val;
>                for (Message msg : values) {
>                        switch (msg.getType()) {
>                        case Global.MSG_INDEX:
>                                PairOfLongs iPayload = msg.getPayload(new
PairOfLongs[0]);
>                                u = iPayload.getLeftElement();
>                                val = iPayload.getRightElement();
>                                est.put(u, val);
>                                S.add(u);
>                                break;
>                        case Global.MSG_UPDATE:
>                                TripleOfLongs uPayload = msg.getPayload(new
TripleOfLongs[0]);
>                                dst = uPayload.getLeftElement();
>                                src = uPayload.getCenterElement();
>                                val = uPayload.getRightElement();
>                                est.put(src, val);
>                                Set<Long> s = D.get(dst);
>                                if (s == null) {
>                                        s = new HashSet<Long>();
>                                        D.put(dst, s);
>                                }
>                                s.add(src);
>                                break;
>                        default:
>                                break;
>                        }
>                }
>
>                boolean changed = Global.improveEstimate(est, S);
>
>                LongWritable wDest = new LongWritable();
>                Vertex vertex = new Vertex();
>                for (long s : S) {
>                        wDest.set(s);
>                        vertex.getNeighbors().set(D.get(s));
>                        vertex.setEstimate(est.get(s));
>                        context.write(wDest, vertex);
>                        context.progress();
>                }
>
>                if (changed) {
>                        context.getCounter(Global.COUNTERS_GROUP_NAME,
> Global.COUNTER_UPDATES).increment(1L);
>                }
>        }
> }
>
> 2010/12/18 nitin reddy <chintu.nitin@gmail.com>:
>> Hi Ivan
>>
>> i faced the same thing , try adding in the /etc/hosts the host names
>> with their IP address like
>>
>> <ip_master>  master
>> <ip_master> masterhostname
>> <ip_slave>  slave
>> <ip_slave> slavehostname
>>
>>
>>
>> On Sat, Dec 18, 2010 at 7:54 AM, Ivan Leonardi <ivanleonardi@gmail.com> wrote:
>>> Hi all,
>>> I'm running Hadoop 0.20.2 in a cluster setup built by 4 machines, and
>>> I'm trying to run my own job. Here is my configuration:
>>>
>>> core-site.xml
>>> <configuration>
>>> <property>
>>> <name>fs.default.name</name><value>hdfs://net-server00:44310</value>
>>> </property>
>>> <property>
>>> <name>hadoop.tmp.dir</name><value>/tmp/hadoop-leonardi</value>
>>> </property>
>>> </configuration>
>>>
>>> hdfs-site.xml
>>> <configuration>
>>> <property>
>>> <name>dfs.replication</name><value>1</value>
>>> </property>
>>> </configuration>
>>>
>>> mapred-site.xml
>>> <configuration>
>>> <property>
>>> <name>mapred.job.tracker</name><value>net-server00:44311</value>
>>> </property>
>>> <property>
>>> <name>mapred.tasktracker.map.tasks.maximum</name><value>3</value>
>>> </property>
>>> <property>
>>> <name>mapred.tasktracker.reduce.tasks.maximum</name><value>3</value>
>>> </property>
>>> <property>
>>> <name>tasktracker.http.threads</name><value>60</value>
>>> </property>
>>> <property>
>>> <name>mapred.reduce.parallel.copies</name><value>50</value>
>>> </property>
>>> </configuration>
>>>
>>> My job runs normally at the beginning:
>>> 10/12/18 14:16:36 INFO input.FileInputFormat: Total input paths to process :
30
>>> 10/12/18 14:16:36 INFO mapred.JobClient: Running job: job_201012181359_0004
>>> 10/12/18 14:16:37 INFO mapred.JobClient:  map 0% reduce 0%
>>> 10/12/18 14:16:51 INFO mapred.JobClient:  map 20% reduce 0%
>>> 10/12/18 14:16:54 INFO mapred.JobClient:  map 40% reduce 0%
>>> 10/12/18 14:17:00 INFO mapred.JobClient:  map 60% reduce 0%
>>> 10/12/18 14:17:03 INFO mapred.JobClient:  map 80% reduce 2%
>>> 10/12/18 14:17:06 INFO mapred.JobClient:  map 80% reduce 5%
>>> 10/12/18 14:17:09 INFO mapred.JobClient:  map 90% reduce 7%
>>> 10/12/18 14:17:12 INFO mapred.JobClient:  map 100% reduce 9%
>>> 10/12/18 14:17:15 INFO mapred.JobClient:  map 100% reduce 10%
>>> 10/12/18 14:17:18 INFO mapred.JobClient:  map 100% reduce 11%
>>> 10/12/18 14:17:19 INFO mapred.JobClient:  map 100% reduce 13%
>>> 10/12/18 14:17:21 INFO mapred.JobClient:  map 100% reduce 21%
>>> 10/12/18 14:17:22 INFO mapred.JobClient:  map 100% reduce 23%
>>> 10/12/18 14:17:24 INFO mapred.JobClient:  map 100% reduce 28%
>>> 10/12/18 14:17:27 INFO mapred.JobClient:  map 100% reduce 33%
>>> 10/12/18 14:17:29 INFO mapred.JobClient:  map 100% reduce 35%
>>> 10/12/18 14:17:31 INFO mapred.JobClient:  map 100% reduce 38%
>>> 10/12/18 14:17:34 INFO mapred.JobClient:  map 100% reduce 40%
>>>
>>> at this point, this percentages are blocked for 10 minutes (600 secs).
>>> In the meanwhile I looked at the Task Tracker Status, I see 3 running
>>> reduce tasks in each machine's proper web page having Status=RUNNING
>>> and Progress=100%; here is a sample:
>>> TaskAttempts                           Status    Progress
>>> attempt_201012181359_0004_r_000001_0   RUNNING   100.00%
>>> attempt_201012181359_0004_r_000003_0   RUNNING   100.00%
>>> attempt_201012181359_0004_r_000006_0   RUNNING   100.00%
>>>
>>> Therefore, after 600 secs, the console continues the logging by
>>> emitting the following:
>>> 10/12/18 14:27:34 INFO mapred.JobClient:  map 100% reduce 36%
>>> 10/12/18 14:27:36 INFO mapred.JobClient: Task Id :
>>> attempt_201012181359_0004_r_000006_0, Status : FAILED Task
>>> attempt_201012181359_0004_r_000006_0 failed to report status for 602
>>> seconds. Killing!
>>> 10/12/18 14:27:38 INFO mapred.JobClient:  map 100% reduce 23%
>>> 10/12/18 14:27:40 INFO mapred.JobClient:  map 100% reduce 6%
>>> 10/12/18 14:27:40 INFO mapred.JobClient: Task Id :
>>> attempt_201012181359_0004_r_000007_0, Status : FAILED
>>> Task attempt_201012181359_0004_r_000007_0 failed to report status for
>>> 601 seconds. Killing!
>>> 10/12/18 14:27:40 INFO mapred.JobClient: Task Id :
>>> attempt_201012181359_0004_r_000010_0, Status : FAILED
>>> Task attempt_201012181359_0004_r_000010_0 failed to report status for
>>> 601 seconds. Killing!
>>> 10/12/18 14:27:41 INFO mapred.JobClient:  map 100% reduce 0%
>>> 10/12/18 14:27:42 INFO mapred.JobClient: Task Id :
>>> attempt_201012181359_0004_r_000000_0, Status : FAILED
>>> Task attempt_201012181359_0004_r_000000_0 failed to report status for
>>> 600 seconds. Killing!
>>> 10/12/18 14:27:42 INFO mapred.JobClient: Task Id :
>>> attempt_201012181359_0004_r_000009_0, Status : FAILED
>>> Task attempt_201012181359_0004_r_000009_0 failed to report status for
>>> 600 seconds. Killing!
>>> 10/12/18 14:27:43 INFO mapred.JobClient: Task Id :
>>> attempt_201012181359_0004_r_000001_0, Status : FAILED
>>> Task attempt_201012181359_0004_r_000001_0 failed to report status for
>>> 601 seconds. Killing!
>>> 10/12/18 14:27:43 INFO mapred.JobClient: Task Id :
>>> attempt_201012181359_0004_r_000003_0, Status : FAILED
>>> Task attempt_201012181359_0004_r_000003_0 failed to report status for
>>> 602 seconds. Killing!
>>> 10/12/18 14:27:45 INFO mapred.JobClient: Task Id :
>>> attempt_201012181359_0004_r_000004_0, Status : FAILED
>>> Task attempt_201012181359_0004_r_000004_0 failed to report status for
>>> 601 seconds. Killing!
>>> 10/12/18 14:27:45 INFO mapred.JobClient: Task Id :
>>> attempt_201012181359_0004_r_000008_0, Status : FAILED
>>> Task attempt_201012181359_0004_r_000008_0 failed to report status for
>>> 601 seconds. Killing!
>>> 10/12/18 14:27:46 INFO mapred.JobClient:  map 100% reduce 1%
>>> 10/12/18 14:27:46 INFO mapred.JobClient: Task Id :
>>> attempt_201012181359_0004_r_000011_0, Status : FAILED
>>> Task attempt_201012181359_0004_r_000011_0 failed to report status for
>>> 601 seconds. Killing!
>>> 10/12/18 14:27:46 INFO mapred.JobClient: Task Id :
>>> attempt_201012181359_0004_r_000002_0, Status : FAILED
>>> Task attempt_201012181359_0004_r_000002_0 failed to report status for
>>> 602 seconds. Killing!
>>> 10/12/18 14:27:48 INFO mapred.JobClient: Task Id :
>>> attempt_201012181359_0004_r_000005_0, Status : FAILED
>>> Task attempt_201012181359_0004_r_000005_0 failed to report status for
>>> 602 seconds. Killing!
>>> 10/12/18 14:27:49 INFO mapred.JobClient:  map 100% reduce 2%
>>> 10/12/18 14:27:55 INFO mapred.JobClient:  map 100% reduce 4%
>>> 10/12/18 14:27:56 INFO mapred.JobClient:  map 100% reduce 6%
>>> 10/12/18 14:27:58 INFO mapred.JobClient:  map 100% reduce 9%
>>> 10/12/18 14:28:00 INFO mapred.JobClient:  map 100% reduce 14%
>>> 10/12/18 14:28:02 INFO mapred.JobClient:  map 100% reduce 22%
>>> 10/12/18 14:28:03 INFO mapred.JobClient:  map 100% reduce 25%
>>> 10/12/18 14:28:05 INFO mapred.JobClient:  map 100% reduce 31%
>>> 10/12/18 14:28:06 INFO mapred.JobClient:  map 100% reduce 33%
>>> 10/12/18 14:28:08 INFO mapred.JobClient:  map 100% reduce 36%
>>> 10/12/18 14:28:09 INFO mapred.JobClient:  map 100% reduce 37%
>>> 10/12/18 14:28:11 INFO mapred.JobClient:  map 100% reduce 38%
>>> 10/12/18 14:28:12 INFO mapred.JobClient:  map 100% reduce 39%
>>> 10/12/18 14:28:14 INFO mapred.JobClient:  map 100% reduce 40%
>>>
>>> and such behaviour continues.
>>>
>>> Is there anyone had the same problem and can help me?
>>> Thank you all!
>>>
>>> Ivan
>>>
>>
>

Mime
View raw message