hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Tom White" <tom.e.wh...@gmail.com>
Subject Re: s3
Date Mon, 08 Jan 2007 20:45:08 GMT
> Here's a thought: implement a simple read-only HttpFileSystem that works
> for MapReduce input.  It couldn't perform directory enumeration, so job
> inputs would have to be listed explicitly, not as directories or glob
> patterns.
>
> For raw S3, one could make a subclass that adds directory enumeration,
> since that's possible with S3, but still throws exceptions for renames,
> etc.  (One could also add support for write and delete.)
>
> CopyFiles could then use http uris directly, so it wouldn't need a
> separate mapper for http inputs and would be further simplified.
> Processing Hadoop log files should also be possible using an HttpFileSystem.
>
> One could even extend HttpFileSystem to work for basic MapReduce output,
> using HTTP PUT to store files, passing a configured authorization.  File
> deletion could be implemented with DELETE.  One could adopt a convention
> that HTTP URIs ending in slashes indicate directories.  Directory
> enumeration could then work by parsing the return HTML directory
> listing, providing a reasonably complete FileSystem implementation.
>

This sounds like a good plan. I wonder whether the existing
block-based s3 scheme should be renamed (as s3block or similar) so s3
is the scheme that sores raw files as you describe?

Tom

Mime
View raw message