hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Tom White (JIRA)" <j...@apache.org>
Subject [jira] Updated: (HADOOP-3095) Validating input paths and creating splits is slow on S3
Date Mon, 02 Jun 2008 13:35:44 GMT

     [ https://issues.apache.org/jira/browse/HADOOP-3095?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Tom White updated HADOOP-3095:
------------------------------

    Attachment: hadoop-3095.patch

I've extended Owen's patch to 

1. Add an overloaded version of FileSystem#getBlockLocations() that takes a FileStatus object
so implementations of FileSystem have a way of avoiding an exists call per file.
2. Deprecate validateInput.
3. For back compatibility call listPath and listStatus, and if the paths are the same use
the FileStatus objects, otherwise fall back to the existing behaviour (use the Path objects)
and issue a warning. When we remove listPath in a subsequent release we can remove this check.

I couldn't see how to get the ThreadLocal approach suggested by Doug to work if a subclass
implementation of listPath calls its superclass (which is a common case, since the subclass
typically modifies the list returned by the superclass).  Calling both listPath and listStatus
does add some overhead, but with validateInput gone the patch shouldn't make things worse
overall.

> Validating input paths and creating splits is slow on S3
> --------------------------------------------------------
>
>                 Key: HADOOP-3095
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3095
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: fs, fs/s3
>            Reporter: Tom White
>            Assignee: Owen O'Malley
>             Fix For: 0.18.0
>
>         Attachments: faster-job-init.patch, hadoop-3095.patch
>
>
> A call to listPaths on S3FileSystem results in an S3 access for each file in the directory
being queried. If the input contains hundreds or thousands of files this is prohibitively
slow. This method is called in FileInputFormat.validateInput and FileInputFormat.getSplits.
This would be easy to fix by overriding listPaths (all four variants) in S3FileSystem to not
use listStatus which creates a FileStatus object for each subpath. However, listPaths is deprecated
in favour of listStatus so this might be OK as a short term measure, but not longer term.
> But it gets worse: FileInputFormat.getSplits goes on to access S3 a further six times
for each input file via these calls:
> 1. fs.isDirectory
> 2. fs.exists
> 3. fs.getLength
> 4. fs.getLength
> 5. fs.exists (from fs.getFileBlockLocations)
> 6. fs.getBlockSize
> So it would be best to change getSplits to use listStatus, and only access S3 once for
each file. (This would help HDFS too.) This change would require some care since FileInputFormat
has a protected method listPaths which subclasses can override (although, in passing I notice
validateInput doesn't use listPaths - is this a bug?).
> For input validation, one approach would be to disable it for S3 by creating a custom
FileInputFormat. In this case, missing files would be detected during split generation. Alternatively,
it may be possible to cache the input paths between validateInput and getSplits.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message