Return-Path: X-Original-To: apmail-crunch-user-archive@www.apache.org Delivered-To: apmail-crunch-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id A1EC718C6F for ; Tue, 24 Nov 2015 01:45:53 +0000 (UTC) Received: (qmail 25698 invoked by uid 500); 24 Nov 2015 01:45:53 -0000 Delivered-To: apmail-crunch-user-archive@crunch.apache.org Received: (qmail 25639 invoked by uid 500); 24 Nov 2015 01:45:53 -0000 Mailing-List: contact user-help@crunch.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@crunch.apache.org Delivered-To: mailing list user@crunch.apache.org Received: (qmail 25626 invoked by uid 99); 24 Nov 2015 01:45:53 -0000 Received: from Unknown (HELO spamd1-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 24 Nov 2015 01:45:53 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd1-us-west.apache.org (ASF Mail Server at spamd1-us-west.apache.org) with ESMTP id 07266C69CA for ; Tue, 24 Nov 2015 01:45:53 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd1-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 2.9 X-Spam-Level: ** X-Spam-Status: No, score=2.9 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, HTML_MESSAGE=3, RCVD_IN_MSPIKE_H2=-0.001, URIBL_BLOCKED=0.001] autolearn=disabled Authentication-Results: spamd1-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx1-eu-west.apache.org ([10.40.0.8]) by localhost (spamd1-us-west.apache.org [10.40.0.7]) (amavisd-new, port 10024) with ESMTP id tFiuTiy_qM1G for ; Tue, 24 Nov 2015 01:45:39 +0000 (UTC) Received: from mail-ob0-f171.google.com (mail-ob0-f171.google.com [209.85.214.171]) by mx1-eu-west.apache.org (ASF Mail Server at mx1-eu-west.apache.org) with ESMTPS id 9800225E4C for ; Tue, 24 Nov 2015 01:45:38 +0000 (UTC) Received: by obbbj7 with SMTP id bj7so2212616obb.1 for ; Mon, 23 Nov 2015 17:45:37 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :content-type; bh=hAQhOtE1xQklNKGSZy/gBzWIqNL3D/9GFtI9T2h4+8o=; b=K2ZNqYXU78+GWLj18ee9Owr4NA7R7ecxvaKLApkQY3ya40gOVQDquVrH72xu2ys2Gu DpCerBULLLhbve20M2xyoxyIIVn0f3D9GSYAHEd7aLyaKMjciqTCNmGSxFz74KLSb5Sb 0MYYDiLyRpgVm7NDNbJA4IqKWeAsPmrBSMC1kLA0IVI/Dpb4ulfpisqKgVQ+0WgLaIUs TU0nRWrCR/lRT1cNYce4tIZ6ESbwWHbS53ceV//9/3DIdoenjGYtu7+8NuB+HY+jiZhr qtbeQ7ZUKwjGD+lX0rH7FyK5uR7C2/7zeXv6WDJYIl+LUPV/eAK+1LEHSGZn56Asgfr0 bIEw== X-Received: by 10.60.60.3 with SMTP id d3mr7900538oer.24.1448329537421; Mon, 23 Nov 2015 17:45:37 -0800 (PST) MIME-Version: 1.0 Received: by 10.202.73.213 with HTTP; Mon, 23 Nov 2015 17:45:18 -0800 (PST) In-Reply-To: References: From: Josh Wills Date: Mon, 23 Nov 2015 17:45:18 -0800 Message-ID: Subject: Re: CrunchJobHooks.CompletionHook Inefficiency on S3NativeFileSystem To: "user@crunch.apache.org" Content-Type: multipart/alternative; boundary=089e013a04a89889e005253f7fda --089e013a04a89889e005253f7fda Content-Type: text/plain; charset=UTF-8 (I don't know the answer to this, but as I also now run Crunch on top of S3, I'm interested in a solution.) On Mon, Nov 23, 2015 at 5:22 PM, Jeff Quinn wrote: > Hey All, > > We have run in to a pretty frustrating inefficiency inside of > the CrunchJobHooks.CompletionHook#handleMultiPaths. > > This method loops over all of the partial output files and moves them to > their ultimate destination directories, > calling org.apache.hadoop.fs.FileSystem#rename(org.apache.hadoop.fs.Path, > org.apache.hadoop.fs.Path) on each partial output in a loop. > > This is no problem when the org.apache.hadoop.fs.FileSystem in question is > HDFS where #rename is a cheap operation, but when an implementation such > as S3NativeFileSystem is used it is extremely inefficient, as each > iteration through the loop makes a single blocking S3 API call, and this > loop can be extremely long when there are many thousands of partial output > files. > > Has anyone dealt with this before / have any ideas to work around? > > Thanks! > > Jeff > > > > *DISCLAIMER:* The contents of this email, including any attachments, may > contain information that is confidential, proprietary in nature, protected > health information (PHI), or otherwise protected by law from disclosure, > and is solely for the use of the intended recipient(s). If you are not the > intended recipient, you are hereby notified that any use, disclosure or > copying of this email, including any attachments, is unauthorized and > strictly prohibited. If you have received this email in error, please > notify the sender of this email. Please delete this and all copies of this > email from your system. Any opinions either expressed or implied in this > email and all attachments, are those of its author only, and do not > necessarily reflect those of Nuna Health, Inc. --089e013a04a89889e005253f7fda Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable
(I don't know the answer to this, but as I also now ru= n Crunch on top of S3, I'm interested in a solution.)

On Mon, Nov 23, 2015 at 5:2= 2 PM, Jeff Quinn <jeff@nuna.com> wrote:
Hey All,

We have run in to a = pretty frustrating inefficiency inside of the=C2=A0CrunchJobHooks.Completio= nHook#handleMultiPaths.=C2=A0

This method loops ov= er all of the partial output files and moves them to their ultimate destina= tion directories, calling=C2=A0org.apache.hadoop.fs.FileSystem#rename(org.a= pache.hadoop.fs.Path, org.apache.hadoop.fs.Path) on each partial output in = a loop.

This is no problem when the org.apache.had= oop.fs.FileSystem in question is HDFS where #rename is a cheap operation, b= ut when an implementation such as=C2=A0S3NativeFileSystem is used it is ext= remely inefficient, as each iteration through the loop makes a single block= ing S3 API call, and this loop can be extremely long when there are many th= ousands of partial output files.

Has anyone dealt = with this before / have any ideas to work around?=C2=A0

Thanks!

Jeff=C2=A0

=

DISCLAIMER:=C2=A0The conten= ts of this email, including any attachments, may contain information that i= s confidential, proprietary in nature, protected health information (PHI), = or otherwise protected by law from disclosure, and is solely for the use of= the intended recipient(s). If you are not the intended recipient, you are = hereby notified that any use, disclosure or copying of this email, includin= g any attachments, is unauthorized and strictly prohibited. If you have rec= eived this email in error, please notify the sender of this email. Please d= elete this and all copies of this email from your system. Any opinions eith= er expressed or implied in this email and all attachments, are those of its= author only, and do not necessarily reflect those of Nuna Health, Inc.

--089e013a04a89889e005253f7fda--