hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From David Batista <dsbati...@xldb.di.fc.ul.pt>
Subject Reduce() time takes ~4x Map()
Date Thu, 28 May 2009 16:41:30 GMT
Hi everyone,

I'm processing XML files, around 500MB each with several documents,
for the map() function I pass a document from the XML file, which
takes some time to process depending on the size - I'm applying NER to

Each document has a unique identifier, so I'm using that identifier as
a key and the results of parsing the document in one string as the

so at the end of  the map function():
output.collect( new Text(identifier), new Text(outputString));

usually the outputString is around 1k-5k size

public void reduce(Text key, Iterator<Text> values,
OutputCollector<Text, Text> output, Reporter reporter) {
	while (values.hasNext()) {
		Text text = values.next();
		try {
			output.collect(key, text);
		} catch (IOException e) {
			// TODO Auto-generated catch block

I did a test using only 1 machine with 8 cores, and only 1 XML file,
it took around 3 hours to process all maps and  ~12hours for the

the XML file has 139 945 documents

I set the jobconf for 1000 maps() and 200 reduces()

I did took a look at graphs on the web interface during the reduce
phase, and indeed its the copy phase that's taking much of the time,
the sort and reduce phase are done almost instantly.

Why does the copy phase takes so long? I understand that the copies
are made using HTTP, and the data was in really small chunks 1k-5k
size, but even so, being everything in the same physical machine
should have been faster, no?

Any suggestions on what might be causing the copies in reduce to take so long?

View raw message