hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Kyle Sampson <k...@threerings.net>
Subject FileSystem.listStatus() on S3
Date Thu, 29 May 2008 18:25:34 GMT
We're using Hadoop 0.17 with S3 as the filesystem.  We've created a  
custom InputFormat for our data.  One of the things it needs to do is  
on InputFormat.getSplits() list all of the files and directories under  
a certain path, and there may be thousands of entries in there.  It's  
using FileSystem.listStatus() to get these paths.  With S3, this is  
turning out to be extraordinarily slow with directories that contain  
on the order of thousands of subdirectories and files.

Looking into it a bit, it seems listStatus() is making a call to S3  
for every subdirectory or file found to get extra file status  
information.  It seems there used to be a listPaths() method that  
would just get the paths, but that's been deprecated and removed.  Is  
there any way currently to get just a list of paths without status  
information?

Kyle Sampson
kyle@threerings.net




Mime
View raw message