hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ivan Leonardi <ivanleona...@gmail.com>
Subject Re: Reduce task progress at 100% but status is still RUNNING
Date Sat, 18 Dec 2010 14:58:08 GMT
Yes, I'm performing computation on large graphs (3.7M of nodes) and
set the numReduceTasks to 60. Below there is the code. I'm using a
grouping function grouper.getGroup(u) that computes the group ID of
the node having u as ID: the number of groups actually corresponds to
the numReduceTasks parameter. Am I doing it wrong, maybe? If I use
e.g. 20 reduce tasks, I got memory problems of Java heap, since a
reducer task receives all the information about a group (thus in the
order of 3.7M/${groups_count}).
/etc/hosts is ok, the hosts knows each other.

public static class IEMapper extends Mapper<LongWritable, Vertex,
LongWritable, Message> {
	private Grouper grouper;
	public void map(LongWritable key, Vertex value, Context context)
throws IOException, InterruptedException {
		long u = key.get();
		long e = value.getEstimate();
		LongWritable groupId = new LongWritable();

		Message indexMessage = new Message(Global.MSG_INDEX);
		indexMessage.setPayload(new PairOfLongs(u, e));
		groupId.set(grouper.getGroup(u));
		context.write(groupId, indexMessage);

		Message updateMessage = new Message(Global.MSG_UPDATE);
		updateMessage.setPayload(new TripleOfLongs());

		for (Long v : value.getNeighbors().getAll()) {
			groupId.set(grouper.getGroup(v));
			updateMessage.getPayload(new TripleOfLongs[0]).set(v, u, e);
			context.write(groupId, updateMessage);
			context.progress();
		}
	}
}

public static class IEReducer extends Reducer<LongWritable, Message,
LongWritable, Vertex> {

	public void reduce(LongWritable key, Iterable<Message> values,
Context context) throws IOException, InterruptedException {
		TreeSet<Long> S = new TreeSet<Long>();
		Hashtable<Long, Set<Long>> D = new Hashtable<Long, Set<Long>>();
		Hashtable<Long, Long> est = new Hashtable<Long, Long>();

		long dst, src, u, val;
		for (Message msg : values) {
			switch (msg.getType()) {
			case Global.MSG_INDEX:
				PairOfLongs iPayload = msg.getPayload(new PairOfLongs[0]);
				u = iPayload.getLeftElement();
				val = iPayload.getRightElement();
				est.put(u, val);
				S.add(u);
				break;
			case Global.MSG_UPDATE:
				TripleOfLongs uPayload = msg.getPayload(new TripleOfLongs[0]);
				dst = uPayload.getLeftElement();
				src = uPayload.getCenterElement();
				val = uPayload.getRightElement();
				est.put(src, val);
				Set<Long> s = D.get(dst);
				if (s == null) {
					s = new HashSet<Long>();
					D.put(dst, s);
				}
				s.add(src);
				break;
			default:
				break;
			}
		}

		boolean changed = Global.improveEstimate(est, S);

		LongWritable wDest = new LongWritable();
		Vertex vertex = new Vertex();
		for (long s : S) {
			wDest.set(s);
			vertex.getNeighbors().set(D.get(s));
			vertex.setEstimate(est.get(s));
			context.write(wDest, vertex);
			context.progress();
		}

		if (changed) {
			context.getCounter(Global.COUNTERS_GROUP_NAME,
Global.COUNTER_UPDATES).increment(1L);
		}
	}
}

2010/12/18 nitin reddy <chintu.nitin@gmail.com>:
> Hi Ivan
>
> i faced the same thing , try adding in the /etc/hosts the host names
> with their IP address like
>
> <ip_master>  master
> <ip_master> masterhostname
> <ip_slave>  slave
> <ip_slave> slavehostname
>
>
>
> On Sat, Dec 18, 2010 at 7:54 AM, Ivan Leonardi <ivanleonardi@gmail.com> wrote:
>> Hi all,
>> I'm running Hadoop 0.20.2 in a cluster setup built by 4 machines, and
>> I'm trying to run my own job. Here is my configuration:
>>
>> core-site.xml
>> <configuration>
>> <property>
>> <name>fs.default.name</name><value>hdfs://net-server00:44310</value>
>> </property>
>> <property>
>> <name>hadoop.tmp.dir</name><value>/tmp/hadoop-leonardi</value>
>> </property>
>> </configuration>
>>
>> hdfs-site.xml
>> <configuration>
>> <property>
>> <name>dfs.replication</name><value>1</value>
>> </property>
>> </configuration>
>>
>> mapred-site.xml
>> <configuration>
>> <property>
>> <name>mapred.job.tracker</name><value>net-server00:44311</value>
>> </property>
>> <property>
>> <name>mapred.tasktracker.map.tasks.maximum</name><value>3</value>
>> </property>
>> <property>
>> <name>mapred.tasktracker.reduce.tasks.maximum</name><value>3</value>
>> </property>
>> <property>
>> <name>tasktracker.http.threads</name><value>60</value>
>> </property>
>> <property>
>> <name>mapred.reduce.parallel.copies</name><value>50</value>
>> </property>
>> </configuration>
>>
>> My job runs normally at the beginning:
>> 10/12/18 14:16:36 INFO input.FileInputFormat: Total input paths to process : 30
>> 10/12/18 14:16:36 INFO mapred.JobClient: Running job: job_201012181359_0004
>> 10/12/18 14:16:37 INFO mapred.JobClient:  map 0% reduce 0%
>> 10/12/18 14:16:51 INFO mapred.JobClient:  map 20% reduce 0%
>> 10/12/18 14:16:54 INFO mapred.JobClient:  map 40% reduce 0%
>> 10/12/18 14:17:00 INFO mapred.JobClient:  map 60% reduce 0%
>> 10/12/18 14:17:03 INFO mapred.JobClient:  map 80% reduce 2%
>> 10/12/18 14:17:06 INFO mapred.JobClient:  map 80% reduce 5%
>> 10/12/18 14:17:09 INFO mapred.JobClient:  map 90% reduce 7%
>> 10/12/18 14:17:12 INFO mapred.JobClient:  map 100% reduce 9%
>> 10/12/18 14:17:15 INFO mapred.JobClient:  map 100% reduce 10%
>> 10/12/18 14:17:18 INFO mapred.JobClient:  map 100% reduce 11%
>> 10/12/18 14:17:19 INFO mapred.JobClient:  map 100% reduce 13%
>> 10/12/18 14:17:21 INFO mapred.JobClient:  map 100% reduce 21%
>> 10/12/18 14:17:22 INFO mapred.JobClient:  map 100% reduce 23%
>> 10/12/18 14:17:24 INFO mapred.JobClient:  map 100% reduce 28%
>> 10/12/18 14:17:27 INFO mapred.JobClient:  map 100% reduce 33%
>> 10/12/18 14:17:29 INFO mapred.JobClient:  map 100% reduce 35%
>> 10/12/18 14:17:31 INFO mapred.JobClient:  map 100% reduce 38%
>> 10/12/18 14:17:34 INFO mapred.JobClient:  map 100% reduce 40%
>>
>> at this point, this percentages are blocked for 10 minutes (600 secs).
>> In the meanwhile I looked at the Task Tracker Status, I see 3 running
>> reduce tasks in each machine's proper web page having Status=RUNNING
>> and Progress=100%; here is a sample:
>> TaskAttempts                           Status    Progress
>> attempt_201012181359_0004_r_000001_0   RUNNING   100.00%
>> attempt_201012181359_0004_r_000003_0   RUNNING   100.00%
>> attempt_201012181359_0004_r_000006_0   RUNNING   100.00%
>>
>> Therefore, after 600 secs, the console continues the logging by
>> emitting the following:
>> 10/12/18 14:27:34 INFO mapred.JobClient:  map 100% reduce 36%
>> 10/12/18 14:27:36 INFO mapred.JobClient: Task Id :
>> attempt_201012181359_0004_r_000006_0, Status : FAILED Task
>> attempt_201012181359_0004_r_000006_0 failed to report status for 602
>> seconds. Killing!
>> 10/12/18 14:27:38 INFO mapred.JobClient:  map 100% reduce 23%
>> 10/12/18 14:27:40 INFO mapred.JobClient:  map 100% reduce 6%
>> 10/12/18 14:27:40 INFO mapred.JobClient: Task Id :
>> attempt_201012181359_0004_r_000007_0, Status : FAILED
>> Task attempt_201012181359_0004_r_000007_0 failed to report status for
>> 601 seconds. Killing!
>> 10/12/18 14:27:40 INFO mapred.JobClient: Task Id :
>> attempt_201012181359_0004_r_000010_0, Status : FAILED
>> Task attempt_201012181359_0004_r_000010_0 failed to report status for
>> 601 seconds. Killing!
>> 10/12/18 14:27:41 INFO mapred.JobClient:  map 100% reduce 0%
>> 10/12/18 14:27:42 INFO mapred.JobClient: Task Id :
>> attempt_201012181359_0004_r_000000_0, Status : FAILED
>> Task attempt_201012181359_0004_r_000000_0 failed to report status for
>> 600 seconds. Killing!
>> 10/12/18 14:27:42 INFO mapred.JobClient: Task Id :
>> attempt_201012181359_0004_r_000009_0, Status : FAILED
>> Task attempt_201012181359_0004_r_000009_0 failed to report status for
>> 600 seconds. Killing!
>> 10/12/18 14:27:43 INFO mapred.JobClient: Task Id :
>> attempt_201012181359_0004_r_000001_0, Status : FAILED
>> Task attempt_201012181359_0004_r_000001_0 failed to report status for
>> 601 seconds. Killing!
>> 10/12/18 14:27:43 INFO mapred.JobClient: Task Id :
>> attempt_201012181359_0004_r_000003_0, Status : FAILED
>> Task attempt_201012181359_0004_r_000003_0 failed to report status for
>> 602 seconds. Killing!
>> 10/12/18 14:27:45 INFO mapred.JobClient: Task Id :
>> attempt_201012181359_0004_r_000004_0, Status : FAILED
>> Task attempt_201012181359_0004_r_000004_0 failed to report status for
>> 601 seconds. Killing!
>> 10/12/18 14:27:45 INFO mapred.JobClient: Task Id :
>> attempt_201012181359_0004_r_000008_0, Status : FAILED
>> Task attempt_201012181359_0004_r_000008_0 failed to report status for
>> 601 seconds. Killing!
>> 10/12/18 14:27:46 INFO mapred.JobClient:  map 100% reduce 1%
>> 10/12/18 14:27:46 INFO mapred.JobClient: Task Id :
>> attempt_201012181359_0004_r_000011_0, Status : FAILED
>> Task attempt_201012181359_0004_r_000011_0 failed to report status for
>> 601 seconds. Killing!
>> 10/12/18 14:27:46 INFO mapred.JobClient: Task Id :
>> attempt_201012181359_0004_r_000002_0, Status : FAILED
>> Task attempt_201012181359_0004_r_000002_0 failed to report status for
>> 602 seconds. Killing!
>> 10/12/18 14:27:48 INFO mapred.JobClient: Task Id :
>> attempt_201012181359_0004_r_000005_0, Status : FAILED
>> Task attempt_201012181359_0004_r_000005_0 failed to report status for
>> 602 seconds. Killing!
>> 10/12/18 14:27:49 INFO mapred.JobClient:  map 100% reduce 2%
>> 10/12/18 14:27:55 INFO mapred.JobClient:  map 100% reduce 4%
>> 10/12/18 14:27:56 INFO mapred.JobClient:  map 100% reduce 6%
>> 10/12/18 14:27:58 INFO mapred.JobClient:  map 100% reduce 9%
>> 10/12/18 14:28:00 INFO mapred.JobClient:  map 100% reduce 14%
>> 10/12/18 14:28:02 INFO mapred.JobClient:  map 100% reduce 22%
>> 10/12/18 14:28:03 INFO mapred.JobClient:  map 100% reduce 25%
>> 10/12/18 14:28:05 INFO mapred.JobClient:  map 100% reduce 31%
>> 10/12/18 14:28:06 INFO mapred.JobClient:  map 100% reduce 33%
>> 10/12/18 14:28:08 INFO mapred.JobClient:  map 100% reduce 36%
>> 10/12/18 14:28:09 INFO mapred.JobClient:  map 100% reduce 37%
>> 10/12/18 14:28:11 INFO mapred.JobClient:  map 100% reduce 38%
>> 10/12/18 14:28:12 INFO mapred.JobClient:  map 100% reduce 39%
>> 10/12/18 14:28:14 INFO mapred.JobClient:  map 100% reduce 40%
>>
>> and such behaviour continues.
>>
>> Is there anyone had the same problem and can help me?
>> Thank you all!
>>
>> Ivan
>>
>

Mime
View raw message