hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Chris Douglas (JIRA)" <j...@apache.org>
Subject [jira] Updated: (HADOOP-4888) Use Apache HttpClient for fetching map outputs
Date Sun, 11 Jan 2009 02:42:01 GMT

     [ https://issues.apache.org/jira/browse/HADOOP-4888?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Chris Douglas updated HADOOP-4888:
----------------------------------

    Attachment: 4888-1.patch

@Zheng: You're right, I shouldn't have said "degraded."

@Steve: Thanks for the ivy settings; I hadn't started to consider that, yet. The goal of this
is identical to HADOOP-1338, really. Reimplementing the connection pooling in Hadoop could
offer some advantages (e.g. more granular progress reporting), but appropriating all the work
done in HttpClient seems like a clear win until that work is completed.

I tried a similar, still preliminary patch, but with max connections per host set to 1 and
on a job with different parameters, i.e. mapred.reduce.slowstart.completed.maps=1.0, 38272
maps, 448 reducers, 32MB (generated) per map on ~300 nodes. Times measured are from the start
of the reduce (after all maps have finished, so the stragglers are not a factor) to end of
the shuffle (avg / std.d):

|| Version || 1 || 2 || 3 || 4 || 5 || avg || avg job ||
| r732838 | 786.89 / 45.55 | 842.596 / 70.69 | 1458.75 / 83.88 | 1140.93 / 44.22 | 1294.67
/ 58.87 | 1104.77 | 2479.8 |
| r732838 + patch | 803.261 / 73.36 | 783.243 / 93.34 | 792.106 / 78.94 | 917.153 / 52.91
| 776.756 / 113.56 | 814.50 | 1955.2 |

Many of the parameters need to be adjusted. In particular, the timeouts are worth revisiting,
as are the number of connections and threads at the server and client. Whether the HEAD +
GET imposes a measurable penalty may also merit consideration before this can be committed.
However, the preceding demonstrates that a measurable improvement is possible, and that this
part of the pipeline could be mined for performance improvements.

> Use Apache HttpClient for fetching map outputs
> ----------------------------------------------
>
>                 Key: HADOOP-4888
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4888
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: mapred
>            Reporter: Chris Douglas
>            Assignee: Chris Douglas
>         Attachments: 4888-0.patch, 4888-1.patch
>
>
> It's worth experimenting with the [HttpClient|http://hc.apache.org/httpclient-3.x/] library
to speed up the shuffle.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message