hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Harsh J <ha...@cloudera.com>
Subject Re: HTTP file server, map output, and other files
Date Fri, 24 May 2013 06:43:17 GMT
YARN has a ShuffleHandler plugin used for MR purposes, but the APIs
used here aren't "general"/public so you'd have to build your own
utilities to do that. Its not too difficult to achieve but a general
API would certainly be nice.

Tez (Incubating) aims to solve some of this for users writing YARN
apps in a general way, but it isn't consumable yet. You can follow Tez
on the Apache Incubator at

P.s. As mentioned, YARN-based MR2 does not use HTTP (Jetty) anymore.
It uses Netty.

On Fri, May 24, 2013 at 3:14 AM, John Lilley <john.lilley@redpoint.net> wrote:
> Thanks to previous kind answers and more reading in the elephant book, I now
> understand that mapper tasks place partitioned results into local files that
> are served up to reducers via HTTP:
> “The output file’s partitions are made available to the reducers over HTTP.
> The maximum number of worker threads used to serve the file partitions is
> controlled by the tasktracker.http.threads property; this setting is per
> tasktracker, not per map task slot. The default of 40 may need to be
> increased for large clusters running large jobs. In MapReduce 2, this
> property is not applicable because the maximum number of threads used is set
> automatically based on the number of processors on the machine. (MapReduce 2
> uses Netty, which by default allows up to twice as many threads as there are
> processors.)”
> My question is, for a custom (non-MR) application under YARN, how would I
> set up my application tasks’ output data to be served over HTTP?  Is there
> an API to control this, or are there predefined local folders that will be
> served up?  Once I am finished with the temporary data, how do I request
> that the files are removed?
> Thanks
> John

Harsh J

View raw message