hadoop-mapreduce-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Paul Burkhardt (JIRA)" <j...@apache.org>
Subject [jira] Created: (MAPREDUCE-1973) Optimize input split creation
Date Tue, 27 Jul 2010 23:15:16 GMT
Optimize input split creation
-----------------------------

                 Key: MAPREDUCE-1973
                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1973
             Project: Hadoop Map/Reduce
          Issue Type: Improvement
    Affects Versions: 0.20.2, 0.20.1
         Environment: Intel Nehalem cluster running Red Hat.
            Reporter: Paul Burkhardt
            Priority: Minor


The input split returns the locations that host the file blocks in the split. The locations
are determined by the getBlockLocations method of the filesystem client which requires a remote
connection to the filesystem (i.e. HDFS). The remote connection is made for each file in the
entire input split. For jobs with many input files the network connections dominate the cost
of writing the input split file.

A job requests a listing of the input files from the remote filesystem and creates a FileStatus
object as a handle for each file in the listing. The FileStatus object can be imbued with
the necessary host information on the remote end and passed to the client-side in the bulk
return of the listing request. A getHosts method of the FileStatus would then return the locations
for the blocks comprising that file and eliminate the need for another trip to the remote
filesystem.

The INodeFile maintains the blocks for a file and is an obvious choice to be the originator
for the locations of that file. It is also available to the FSDirectory which first creates
the listing of FileStatus objects. We propose that the block locations be generated by the
INodeFile to instantiate the FileStatus object during the getListing request.

Our tests demonstrated a factor of 2000 speedup for approximately 60,000 input files.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message