hadoop-mapreduce-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Liyin Liang (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (MAPREDUCE-2349) speed up list[located]status calls from input formats
Date Wed, 02 May 2012 02:02:49 GMT

    [ https://issues.apache.org/jira/browse/MAPREDUCE-2349?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13266293#comment-13266293

Liyin Liang commented on MAPREDUCE-2349:

This jira is very meaningful for large, busy cluster.
> speed up list[located]status calls from input formats
> -----------------------------------------------------
>                 Key: MAPREDUCE-2349
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-2349
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: task
>            Reporter: Joydeep Sen Sarma
> when a job has many input paths - listStatus - or the improved listLocatedStatus - calls
(invoked from the getSplits() method) can take a long time. Most of the time is spent waiting
for the previous call to complete and then dispatching the next call. 
> This can be greatly speeded up by dispatching multiple calls at once (via executors).
If the same filesystem client is used - then the calls are much better pipelined (since calls
are serialized) and don't impose extra burden on the namenode while at the same time greatly
reducing the latency to the client. In a simple test on non-peak hours, this resulted in the
getSplits() time reducing from about 3s to about 0.5s.

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira


View raw message