hadoop-mapreduce-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Eric Payne (JIRA)" <j...@apache.org>
Subject [jira] [Resolved] (MAPREDUCE-2011) Reduce number of getFileStatus call made from every task(TaskDistributedCache) setup
Date Mon, 11 Jan 2016 17:18:40 GMT

     [ https://issues.apache.org/jira/browse/MAPREDUCE-2011?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Eric Payne resolved MAPREDUCE-2011.
-----------------------------------
    Resolution: Won't Fix

[~knoguchi], here are [~jlowe]'s comments from an offline discussion:
I think the distributed cache already behaves the way you desire, at least in YARN. When a
resource request arrives at the nodemanager, it tries to lookup the local resource info based
on that request. If it finds it (i.e.: a hit in the cache) then it just increments the refcount
of the resource – I don't see any attempt to stat HDFS to verify it's still there in HDFS.
The only time I see the timestamp of the request compared with HDFS is when it tries to download
the resource from HDFS.

> Reduce number of getFileStatus call made from every task(TaskDistributedCache) setup
> ------------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-2011
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-2011
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: distributed-cache
>            Reporter: Koji Noguchi
>
> On our cluster, we had jobs with 20 dist cache and very short-lived tasks resulting in
500 map tasks launched per second resulting in  10,000 getFileStatus calls to the namenode.
 Namenode can handle this but asking to see if we can reduce this somehow.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message