crunch-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Josh Wills <>
Subject Re: CrunchJobHooks.CompletionHook Inefficiency on S3NativeFileSystem
Date Tue, 24 Nov 2015 01:45:18 GMT
(I don't know the answer to this, but as I also now run Crunch on top of
S3, I'm interested in a solution.)

On Mon, Nov 23, 2015 at 5:22 PM, Jeff Quinn <> wrote:

> Hey All,
> We have run in to a pretty frustrating inefficiency inside of
> the CrunchJobHooks.CompletionHook#handleMultiPaths.
> This method loops over all of the partial output files and moves them to
> their ultimate destination directories,
> calling org.apache.hadoop.fs.FileSystem#rename(org.apache.hadoop.fs.Path,
> org.apache.hadoop.fs.Path) on each partial output in a loop.
> This is no problem when the org.apache.hadoop.fs.FileSystem in question is
> HDFS where #rename is a cheap operation, but when an implementation such
> as S3NativeFileSystem is used it is extremely inefficient, as each
> iteration through the loop makes a single blocking S3 API call, and this
> loop can be extremely long when there are many thousands of partial output
> files.
> Has anyone dealt with this before / have any ideas to work around?
> Thanks!
> Jeff
> *DISCLAIMER:* The contents of this email, including any attachments, may
> contain information that is confidential, proprietary in nature, protected
> health information (PHI), or otherwise protected by law from disclosure,
> and is solely for the use of the intended recipient(s). If you are not the
> intended recipient, you are hereby notified that any use, disclosure or
> copying of this email, including any attachments, is unauthorized and
> strictly prohibited. If you have received this email in error, please
> notify the sender of this email. Please delete this and all copies of this
> email from your system. Any opinions either expressed or implied in this
> email and all attachments, are those of its author only, and do not
> necessarily reflect those of Nuna Health, Inc.

View raw message