Return-Path: X-Original-To: apmail-hadoop-user-archive@minotaur.apache.org Delivered-To: apmail-hadoop-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 47135D1B9 for ; Thu, 23 May 2013 21:44:44 +0000 (UTC) Received: (qmail 30369 invoked by uid 500); 23 May 2013 21:44:39 -0000 Delivered-To: apmail-hadoop-user-archive@hadoop.apache.org Received: (qmail 30049 invoked by uid 500); 23 May 2013 21:44:39 -0000 Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hadoop.apache.org Delivered-To: mailing list user@hadoop.apache.org Received: (qmail 30042 invoked by uid 99); 23 May 2013 21:44:39 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 23 May 2013 21:44:39 +0000 X-ASF-Spam-Status: No, hits=2.2 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_NONE,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of john.lilley@redpoint.net designates 206.225.164.221 as permitted sender) Received: from [206.225.164.221] (HELO hub021-nj-5.exch021.serverdata.net) (206.225.164.221) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 23 May 2013 21:44:33 +0000 Received: from MBX021-E3-NJ-2.exch021.domain.local ([10.240.4.78]) by HUB021-NJ-5.exch021.domain.local ([10.240.4.89]) with mapi id 14.02.0318.001; Thu, 23 May 2013 14:44:12 -0700 From: John Lilley To: "user@hadoop.apache.org" Subject: HTTP file server, map output, and other files Thread-Topic: HTTP file server, map output, and other files Thread-Index: Ac5X+t1ZOCH5jQrCRCWhOnqyYitoag== Date: Thu, 23 May 2013 21:44:12 +0000 Message-ID: <869970D71E26D7498BDAC4E1CA92226B6589F561@MBX021-E3-NJ-2.exch021.domain.local> Accept-Language: en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: x-originating-ip: [173.160.43.61] Content-Type: multipart/alternative; boundary="_000_869970D71E26D7498BDAC4E1CA92226B6589F561MBX021E3NJ2exch_" MIME-Version: 1.0 X-Virus-Checked: Checked by ClamAV on apache.org --_000_869970D71E26D7498BDAC4E1CA92226B6589F561MBX021E3NJ2exch_ Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: quoted-printable Thanks to previous kind answers and more reading in the elephant book, I no= w understand that mapper tasks place partitioned results into local files t= hat are served up to reducers via HTTP: "The output file's partitions are made available to the reducers over HTTP.= The maximum number of worker threads used to serve the file partitions is = controlled by the tasktracker.http.threads property; this setting is per ta= sktracker, not per map task slot. The default of 40 may need to be increase= d for large clusters running large jobs. In MapReduce 2, this property is n= ot applicable because the maximum number of threads used is set automatical= ly based on the number of processors on the machine. (MapReduce 2 uses Nett= y, which by default allows up to twice as many threads as there are process= ors.)" My question is, for a custom (non-MR) application under YARN, how would I s= et up my application tasks' output data to be served over HTTP? Is there a= n API to control this, or are there predefined local folders that will be s= erved up? Once I am finished with the temporary data, how do I request tha= t the files are removed? Thanks John --_000_869970D71E26D7498BDAC4E1CA92226B6589F561MBX021E3NJ2exch_ Content-Type: text/html; charset="us-ascii" Content-Transfer-Encoding: quoted-printable

Thanks to previous kind answers and more reading in = the elephant book, I now understand that mapper tasks place partitioned res= ults into local files that are served up to reducers via HTTP:

“The output file’s partitions are made a= vailable to the reducers over HTTP. The maximum number of worker threads us= ed to serve the file partitions is controlled by the tasktracker.http.threa= ds property; this setting is per tasktracker, not per map task slot. The default of 40 may need to be increased for larg= e clusters running large jobs. In MapReduce 2, this property is not applica= ble because the maximum number of threads used is set automatically based o= n the number of processors on the machine. (MapReduce 2 uses Netty, which by default allows up to twice as m= any threads as there are processors.)”

My question is, for a custom (non-MR) application un= der YARN, how would I set up my application tasks’ output data to be = served over HTTP? Is there an API to control this, or are there prede= fined local folders that will be served up? Once I am finished with the temporary data, how do I request that the file= s are removed?

Thanks

John

--_000_869970D71E26D7498BDAC4E1CA92226B6589F561MBX021E3NJ2exch_--