From mapreduce-dev-return-2246-apmail-hadoop-mapreduce-dev-archive=hadoop.apache.org@hadoop.apache.org Tue Jul 27 23:15:38 2010 Return-Path: Delivered-To: apmail-hadoop-mapreduce-dev-archive@minotaur.apache.org Received: (qmail 16159 invoked from network); 27 Jul 2010 23:15:38 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 27 Jul 2010 23:15:38 -0000 Received: (qmail 98377 invoked by uid 500); 27 Jul 2010 23:15:37 -0000 Delivered-To: apmail-hadoop-mapreduce-dev-archive@hadoop.apache.org Received: (qmail 98288 invoked by uid 500); 27 Jul 2010 23:15:37 -0000 Mailing-List: contact mapreduce-dev-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: mapreduce-dev@hadoop.apache.org Delivered-To: mailing list mapreduce-dev@hadoop.apache.org Received: (qmail 98280 invoked by uid 99); 27 Jul 2010 23:15:37 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 27 Jul 2010 23:15:37 +0000 X-ASF-Spam-Status: No, hits=-2000.0 required=10.0 tests=ALL_TRUSTED X-Spam-Check-By: apache.org Received: from [140.211.11.22] (HELO thor.apache.org) (140.211.11.22) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 27 Jul 2010 23:15:36 +0000 Received: from thor (localhost [127.0.0.1]) by thor.apache.org (8.13.8+Sun/8.13.8) with ESMTP id o6RNFGFq009729 for ; Tue, 27 Jul 2010 23:15:16 GMT Message-ID: <3753337.35131280272516295.JavaMail.jira@thor> Date: Tue, 27 Jul 2010 19:15:16 -0400 (EDT) From: "Paul Burkhardt (JIRA)" To: mapreduce-dev@hadoop.apache.org Subject: [jira] Created: (MAPREDUCE-1973) Optimize input split creation MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 Optimize input split creation ----------------------------- Key: MAPREDUCE-1973 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1973 Project: Hadoop Map/Reduce Issue Type: Improvement Affects Versions: 0.20.2, 0.20.1 Environment: Intel Nehalem cluster running Red Hat. Reporter: Paul Burkhardt Priority: Minor The input split returns the locations that host the file blocks in the split. The locations are determined by the getBlockLocations method of the filesystem client which requires a remote connection to the filesystem (i.e. HDFS). The remote connection is made for each file in the entire input split. For jobs with many input files the network connections dominate the cost of writing the input split file. A job requests a listing of the input files from the remote filesystem and creates a FileStatus object as a handle for each file in the listing. The FileStatus object can be imbued with the necessary host information on the remote end and passed to the client-side in the bulk return of the listing request. A getHosts method of the FileStatus would then return the locations for the blocks comprising that file and eliminate the need for another trip to the remote filesystem. The INodeFile maintains the blocks for a file and is an obvious choice to be the originator for the locations of that file. It is also available to the FSDirectory which first creates the listing of FileStatus objects. We propose that the block locations be generated by the INodeFile to instantiate the FileStatus object during the getListing request. Our tests demonstrated a factor of 2000 speedup for approximately 60,000 input files. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.