Return-Path: Delivered-To: apmail-hadoop-core-dev-archive@www.apache.org Received: (qmail 49306 invoked from network); 3 Jun 2008 12:19:11 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 3 Jun 2008 12:19:11 -0000 Received: (qmail 40246 invoked by uid 500); 3 Jun 2008 12:19:11 -0000 Delivered-To: apmail-hadoop-core-dev-archive@hadoop.apache.org Received: (qmail 40203 invoked by uid 500); 3 Jun 2008 12:19:11 -0000 Mailing-List: contact core-dev-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: core-dev@hadoop.apache.org Delivered-To: mailing list core-dev@hadoop.apache.org Received: (qmail 40128 invoked by uid 99); 3 Jun 2008 12:19:11 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 03 Jun 2008 05:19:11 -0700 X-ASF-Spam-Status: No, hits=-2000.0 required=10.0 tests=ALL_TRUSTED X-Spam-Check-By: apache.org Received: from [140.211.11.140] (HELO brutus.apache.org) (140.211.11.140) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 03 Jun 2008 12:18:23 +0000 Received: from brutus (localhost [127.0.0.1]) by brutus.apache.org (Postfix) with ESMTP id 27C5C234C126 for ; Tue, 3 Jun 2008 05:18:45 -0700 (PDT) Message-ID: <284064787.1212495525144.JavaMail.jira@brutus> Date: Tue, 3 Jun 2008 05:18:45 -0700 (PDT) From: "Tom White (JIRA)" To: core-dev@hadoop.apache.org Subject: [jira] Updated: (HADOOP-3095) Validating input paths and creating splits is slow on S3 In-Reply-To: <2085212560.1206552690774.JavaMail.jira@brutus> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-Virus-Checked: Checked by ClamAV on apache.org [ https://issues.apache.org/jira/browse/HADOOP-3095?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tom White updated HADOOP-3095: ------------------------------ Attachment: hadoop-3095-v3.patch New patch to fix unit test failures. Two failures were unrelated to this patch, but three showed up a problem with Path objects in FileStatus not being fully qualified. The problem arises since FileInputFormat#listStatus doesn't return fully-qualified paths so #getSplits can't pick up the correct FileSystem(s). I've changed DistributedFileSystem to return a fully-qualified Path in getFileStatus(). However, this doesn't fully solve the problem as other FileSystem implementations don't return fully-qualified paths in FileStatus. Should we strengthen the contract of FileSystem to require them to do so? (The increased warning count is due to the addition of deprecated methods that are still called, mainly from tests.) > Validating input paths and creating splits is slow on S3 > -------------------------------------------------------- > > Key: HADOOP-3095 > URL: https://issues.apache.org/jira/browse/HADOOP-3095 > Project: Hadoop Core > Issue Type: Improvement > Components: fs, fs/s3 > Reporter: Tom White > Assignee: Owen O'Malley > Fix For: 0.18.0 > > Attachments: faster-job-init.patch, hadoop-3095-v2.patch, hadoop-3095-v3.patch, hadoop-3095.patch > > > A call to listPaths on S3FileSystem results in an S3 access for each file in the directory being queried. If the input contains hundreds or thousands of files this is prohibitively slow. This method is called in FileInputFormat.validateInput and FileInputFormat.getSplits. This would be easy to fix by overriding listPaths (all four variants) in S3FileSystem to not use listStatus which creates a FileStatus object for each subpath. However, listPaths is deprecated in favour of listStatus so this might be OK as a short term measure, but not longer term. > But it gets worse: FileInputFormat.getSplits goes on to access S3 a further six times for each input file via these calls: > 1. fs.isDirectory > 2. fs.exists > 3. fs.getLength > 4. fs.getLength > 5. fs.exists (from fs.getFileBlockLocations) > 6. fs.getBlockSize > So it would be best to change getSplits to use listStatus, and only access S3 once for each file. (This would help HDFS too.) This change would require some care since FileInputFormat has a protected method listPaths which subclasses can override (although, in passing I notice validateInput doesn't use listPaths - is this a bug?). > For input validation, one approach would be to disable it for S3 by creating a custom FileInputFormat. In this case, missing files would be detected during split generation. Alternatively, it may be possible to cache the input paths between validateInput and getSplits. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.