hadoop-mapreduce-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Joydeep Sen Sarma (JIRA)" <j...@apache.org>
Subject [jira] Created: (MAPREDUCE-2349) speed up list[located]status calls from input formats
Date Tue, 01 Mar 2011 22:49:37 GMT
speed up list[located]status calls from input formats
-----------------------------------------------------

                 Key: MAPREDUCE-2349
                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-2349
             Project: Hadoop Map/Reduce
          Issue Type: Improvement
          Components: task
            Reporter: Joydeep Sen Sarma


when a job has many input paths - listStatus - or the improved listLocatedStatus - calls (invoked
from the getSplits() method) can take a long time. Most of the time is spent waiting for the
previous call to complete and then dispatching the next call. 

This can be greatly speeded up by dispatching multiple calls at once (via executors). If the
same filesystem client is used - then the calls are much better pipelined (since calls are
serialized) and don't impose extra burden on the namenode while at the same time greatly reducing
the latency to the client. In a simple test on non-peak hours, this resulted in the getSplits()
time reducing from about 3s to about 0.5s.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message