hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Paul Burkhardt (JIRA)" <j...@apache.org>
Subject [jira] Moved: (HDFS-1402) Optimize input split creation
Date Wed, 15 Sep 2010 21:53:34 GMT

     [ https://issues.apache.org/jira/browse/HDFS-1402?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Paul Burkhardt moved MAPREDUCE-1973 to HDFS-1402:
-------------------------------------------------

              Project: Hadoop HDFS  (was: Hadoop Map/Reduce)
                  Key: HDFS-1402  (was: MAPREDUCE-1973)
    Affects Version/s: 0.22.0
                           (was: 0.20.1)
                           (was: 0.20.2)

> Optimize input split creation
> -----------------------------
>
>                 Key: HDFS-1402
>                 URL: https://issues.apache.org/jira/browse/HDFS-1402
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>    Affects Versions: 0.22.0
>         Environment: Intel Nehalem cluster running Red Hat.
>            Reporter: Paul Burkhardt
>            Priority: Minor
>         Attachments: HADOOP-1973.patch
>
>
> The input split returns the locations that host the file blocks in the split. The locations
are determined by the getBlockLocations method of the filesystem client which requires a remote
connection to the filesystem (i.e. HDFS). The remote connection is made for each file in the
entire input split. For jobs with many input files the network connections dominate the cost
of writing the input split file.
> A job requests a listing of the input files from the remote filesystem and creates a
FileStatus object as a handle for each file in the listing. The FileStatus object can be imbued
with the necessary host information on the remote end and passed to the client-side in the
bulk return of the listing request. A getHosts method of the FileStatus would then return
the locations for the blocks comprising that file and eliminate the need for another trip
to the remote filesystem.
> The INodeFile maintains the blocks for a file and is an obvious choice to be the originator
for the locations of that file. It is also available to the FSDirectory which first creates
the listing of FileStatus objects. We propose that the block locations be generated by the
INodeFile to instantiate the FileStatus object during the getListing request.
> Our tests demonstrated a factor of 2000 speedup for approximately 60,000 input files.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message