hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Esteban Molina-Estolano <eesto...@soe.ucsc.edu>
Subject Adding new filesystem to Hadoop causing too many Map tasks
Date Fri, 01 Jun 2007 08:14:50 GMT
I'm adding support in Hadoop for Ceph (http://ceph.sourceforge.net/),  
a distributed filesystem developed at UC Santa Cruz (http:// 
ssrc.cse.ucsc.edu/). Ceph runs entirely in userspace and is written  
in C++. My current implementation is a subclass of FileSystem that  
uses a bit of JNI glue to invoke the C++ Ceph client code.

I'm having trouble with a small test: RandomWriter, 4 TaskTracker  
nodes, 5 maps per node, 10 MB per map, for a total of 200 MB over 20  
Map tasks. I tried it on Hadoop with DFS, and it took about 30  
seconds. Then, I ran the same test using Ceph. I changed  
fs.default.name to "ceph:///"; added fs.ceph.impl as  
org.apache.hadoop.fs.ceph.CephFileSystem; and left all other  
configuration settings untouched. It ran horrifically slowly.

I ran the JobTracker and each TaskTracker in a separate terminal to  
watch the output. One of the TaskTracker nodes gave me this:
07/06/01 00:16:49 INFO mapred.TaskRunner: task_0001_r_000000_0 Need  
400 map output(s)
07/06/01 00:16:49 INFO mapred.TaskRunner: task_0001_r_000000_0 Need  
400 map output location(s)

Then the JobTracker spawned 400 Map tasks:
07/06/01 00:23:11 INFO mapred.JobTracker: Adding task  
'task_0001_m_000397_0' to tip tip_0001_m_000397, for tracker  
'tracker_issdm-11.cse.ucsc.edu:50050'
07/06/01 00:23:12 INFO mapred.JobInProgress: Task  
'task_0001_m_000396_0' has completed tip_0001_m_000396 successfully.
07/06/01 00:23:12 INFO mapred.TaskInProgress: Task  
'task_0001_m_000396_0' has completed.
07/06/01 00:23:12 INFO mapred.JobInProgress: Choosing normal task  
tip_0001_m_000398
07/06/01 00:23:12 INFO mapred.JobTracker: Adding task  
'task_0001_m_000398_0' to tip tip_0001_m_000398, for tracker  
'tracker_issdm-8.cse.ucsc.edu:50050'
07/06/01 00:23:13 INFO mapred.JobInProgress: Task  
'task_0001_m_000397_0' has completed tip_0001_m_000397 successfully.
07/06/01 00:23:13 INFO mapred.TaskInProgress: Task  
'task_0001_m_000397_0' has completed.
07/06/01 00:23:13 INFO mapred.JobInProgress: Choosing normal task  
tip_0001_m_000399
07/06/01 00:23:13 INFO mapred.JobTracker: Adding task  
'task_0001_m_000399_0' to tip tip_0001_m_000399, for tracker  
'tracker_issdm-11.cse.ucsc.edu:50050'

I'm ending up with way too many Map tasks, and as a result the job  
takes way too long to run.

I strongly suspect this is a problem with my implementation, but I'm  
not sure where to start looking. What sort of problem on the  
FileSystem side could cause MapReduce to spawn so many extra tasks?  
How can I pin down the cause?

Thanks,
     ~ Esteban

Mime
View raw message