crunch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jeffrey Quinn (JIRA)" <>
Subject [jira] [Created] (CRUNCH-580) FileTargetImpl#handleOutputs Inefficiency on S3NativeFileSystem
Date Sat, 05 Dec 2015 01:21:11 GMT
Jeffrey Quinn created CRUNCH-580:

             Summary: FileTargetImpl#handleOutputs Inefficiency on S3NativeFileSystem
                 Key: CRUNCH-580
             Project: Crunch
          Issue Type: Bug
          Components: Core, IO
    Affects Versions: 0.13.0
         Environment: Amazon Elastic Map Reduce
            Reporter: Jeffrey Quinn
            Assignee: Josh Wills

We have run in to a pretty frustrating inefficiency inside of

This method loops over all of the partial output files and moves them to their ultimate destination
directories, calling org.apache.hadoop.fs.FileSystem#rename(org.apache.hadoop.fs.Path, org.apache.hadoop.fs.Path)
on each partial output in a loop.

This is no problem when the org.apache.hadoop.fs.FileSystem in question is HDFS where #rename
is a cheap operation, but when an implementation such as S3NativeFileSystem is used it is
extremely inefficient, as each iteration through the loop makes a single blocking S3 API call,
and this loop can be extremely long when there are many thousands of partial output files.

This message was sent by Atlassian JIRA

View raw message