hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Todd Lipcon <t...@cloudera.com>
Subject Re: About Hadoop optimizations
Date Wed, 06 May 2009 20:58:02 GMT
On Wed, May 6, 2009 at 1:46 PM, Foss User <fossist@gmail.com> wrote:

> Thanks for your response. I got a few more questions regarding
> optimizations.
> 1. Does hadoop clients locally cache the data it last requested?

I don't know the DFS read path very well, but I don't believe there is any
built in cache here. There is an undocumented configuration variable
dfs.read.prefetch.size which affects DFSClient's prefetching of data ahead
of the current file position, but I don't want to give any answer I'm not
certain of. Hopefully someone else will chime in here.

I will answer that there is no *large* cache of data locally. HDFS is
optimized for sequential reads, where a cache is generally useless if not

> 2. Is the meta data for file blocks on data node kept in the
> underlying OS's file system on namenode or is it kept in RAM of the
> name node?

The block locations are kept in the RAM of the name node, and are updated
whenever a Datanode does a "block report". This is why the namenode is in
"safe mode" at startup until it has received block locations for some
configurable percentage of blocks from the datanodes.

> 3. If no mapper more mapper functions can be run on the node that
> contains the data on which the mapper has to act on, is Hadoop
> intelligent enough to run the new mappers on some machines within the
> same rack?

Yes, assuming you have configured a network topology script. Otherwise,
Hadoop has no magical knowledge of your network infrastructure, and it
treats the whole cluster as a single rack called /default-rack

> 4. When can a case like the above happen? I mean when can it happen
> that the maximum number of mappers for a tasktracker configure has
> been reached but Hadoop still needs to start more mappers?

If you have file with 100 blocks all on the same three nodes, but you have a
six node cluster, it will schedule some tasks on nodes that do not contain
the blocks, since it would rather keep the cluster utilized than keep all
data access local.

> 5. Are the multiple mappers and reducers run as separate threads
> within the same TaskTracker process?

No, they are run as child processes.


  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message