crunch-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Josh Wills <josh.wi...@gmail.com>
Subject Re: CrunchJobHooks.CompletionHook Inefficiency on S3NativeFileSystem
Date Tue, 24 Nov 2015 03:47:45 GMT
No, just moving to Slack from Cloudera, my data team is all of two people*
right now, and a dedicated Hadoop ops person doesn't make sense yet.

* But of course, I'm hiring. :)
On Mon, Nov 23, 2015 at 6:43 PM Everett Anderson <everett@nuna.com> wrote:

> Josh, not to steal the thread, but I'm quite curious -- did something
> drive you to using S3 instead of HDFS?
>
> For me, I've been surprised how brittle HDFS seems out of the box in the
> face of even mild load. :( We've spent a lot of time turning knobs to make
> our data nodes stay responsive.
>
>
> On Mon, Nov 23, 2015 at 5:45 PM, Josh Wills <josh.wills@gmail.com> wrote:
>
>> (I don't know the answer to this, but as I also now run Crunch on top of
>> S3, I'm interested in a solution.)
>>
>> On Mon, Nov 23, 2015 at 5:22 PM, Jeff Quinn <jeff@nuna.com> wrote:
>>
>>> Hey All,
>>>
>>> We have run in to a pretty frustrating inefficiency inside of
>>> the CrunchJobHooks.CompletionHook#handleMultiPaths.
>>>
>>> This method loops over all of the partial output files and moves them to
>>> their ultimate destination directories,
>>> calling org.apache.hadoop.fs.FileSystem#rename(org.apache.hadoop.fs.Path,
>>> org.apache.hadoop.fs.Path) on each partial output in a loop.
>>>
>>> This is no problem when the org.apache.hadoop.fs.FileSystem in question
>>> is HDFS where #rename is a cheap operation, but when an implementation such
>>> as S3NativeFileSystem is used it is extremely inefficient, as each
>>> iteration through the loop makes a single blocking S3 API call, and this
>>> loop can be extremely long when there are many thousands of partial output
>>> files.
>>>
>>> Has anyone dealt with this before / have any ideas to work around?
>>>
>>> Thanks!
>>>
>>> Jeff
>>>
>>>
>>>
>>> *DISCLAIMER:* The contents of this email, including any attachments,
>>> may contain information that is confidential, proprietary in nature,
>>> protected health information (PHI), or otherwise protected by law from
>>> disclosure, and is solely for the use of the intended recipient(s). If you
>>> are not the intended recipient, you are hereby notified that any use,
>>> disclosure or copying of this email, including any attachments, is
>>> unauthorized and strictly prohibited. If you have received this email in
>>> error, please notify the sender of this email. Please delete this and all
>>> copies of this email from your system. Any opinions either expressed or
>>> implied in this email and all attachments, are those of its author only,
>>> and do not necessarily reflect those of Nuna Health, Inc.
>>
>>
>>
>
> *DISCLAIMER:* The contents of this email, including any attachments, may
> contain information that is confidential, proprietary in nature, protected
> health information (PHI), or otherwise protected by law from disclosure,
> and is solely for the use of the intended recipient(s). If you are not the
> intended recipient, you are hereby notified that any use, disclosure or
> copying of this email, including any attachments, is unauthorized and
> strictly prohibited. If you have received this email in error, please
> notify the sender of this email. Please delete this and all copies of this
> email from your system. Any opinions either expressed or implied in this
> email and all attachments, are those of its author only, and do not
> necessarily reflect those of Nuna Health, Inc.

Mime
View raw message