hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Amar Kamat <ama...@yahoo-inc.com>
Subject Re: Caching frequently map input files
Date Mon, 11 Feb 2008 06:41:39 GMT
Shimi K wrote:
> Is Hadoop cache frequently/LRU/MRU map input files? Or does it upload files
> from the disk each time a file is needed no matter if it was the same file
> that was required by the last job on the same node?
Hadoop uses data locality for scheduling the tasks. Most of the hits we 
get is through data locality. For the rest we simply transfer the file 
over the network. If the data is huge then as Arun mentioned there will 
be no caching. What could be done is to make the copy local and somehow 
inform the namenode. But I guess the complexity involved in doing this 
handshake and its effect on re-balancing would be more. Also this amount 
of replication will lead to lot of redundancy and hence wastage of 
space. That might be one reason this is not done. If the data size is 
small then it could be cached but the problem is keeping this kind of 
state information will have high complexity as compared to the current 
state where the jobs are fairly independent and hence easy to handle. 
Also its better to copy these files over the network (since the size is 
small and the chance of this happening is also low). Again the question 
is how much of the data would you keep in cache, what policy would you 
use to free the cache and how would you inform the JT about cache 
deletions? Its good to keep it simple.
> I am currently using version 0.14.4
> - Shimi

View raw message