hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Konstantin Shvachko (JIRA)" <j...@apache.org>
Subject [jira] Issue Comment Edited: (HADOOP-3022) Fast Cluster Restart
Date Fri, 09 May 2008 00:02:55 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-3022?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12594019#action_12594019
] 

shv edited comment on HADOOP-3022 at 5/8/08 5:01 PM:
---------------------------------------------------------------------

h2. Estimates.
The name-node startup consists of 4 major steps:
# load fsimage
# load edits
# save new merged fsimage
# process block report

I am estimating the name-node startup time based on a 10 million objects image.
The estimates are based on my experiments with a real cluster.
The numbers are scaled proportionally (4.3)  to 10 mln objects.

Under the assumption that each file on average has 1.5 blocks 10 mln objects 
translates into 4 mln files and directories, and 6 mln blocks.
Other claster parameters are summarized in the following table.

|objects|10 mln|
|files & dirs| 4 mln|
|blocks| 6 mln|
|heap size| 3.275 GB|
|image size| 0.6 GB|
|image load time| 132 sec|
|image save time| 70 sec|
|edits size per day| 0.27 GB|
|edits load time| 84 sec|
|data-nodes| 500|
|blocks per node| 36,000|
|block report processing| 320 sec|
|total startup time| 10 min|

This leads to the total startup time of 10 minutes, out of which
|load fsimage| 22%|
|load edits| 14%|
|save new fsimage| 11%|
|process block reports| 53%|

h2. Optimization.
# VM memory heap parameters play key role in the startup process as discussed in HADOOP-3248.
It is highly recommended to set the initial heap size -Xms close to the maximum heap size
because 
all that memory will be used by the name-node any way.
# Optimization of loading and saving is substantially related to reducing object allocation.
In case of loading, object allocations are unavoidable, so we will need to make sure
there are no intermediate allocations of temporary objects.
In case of saving, all object allocations should be eliminated, which is done in HADOOP-3248.
# During loading for each file or directory the name-node performs a lookup in the
namespace tree starting from the root in order to find the object's parent directory and then
to insert the object into it.
We can take advantage here of that directory children are not interleaving with children of
other directories in the image. 
So we can find a parent once and then add then include all its children without repeating
the search.
# Block reporting is the most expensive single part of the startup process.
According to my experiments most of the processing time here goes into first adding all blocks
into needed replication queue, 
and then removing them from the queue. During startup all blocks are guaranteed to be under-replicated
in the beginning.
But most of them and in the regular case all of them will be removed from that list.
So in a sense we create dynamically a huge temporary structure just to make sure that all
blocks have enough replicas.
In addition to that in pre HADOOP-2606 versions (before release 0.17) the replication monitor
would start processing
 those under-replicated blocks, and try to assign nodes for copying blocks.
The structure works fine during regular operation because it contains only those blocks that
are in fact under-replicated.
The processing of block reports goes 5 times faster if the addition to the needed replications
queue is removed.
# We also should reduce the number of block lookups in the blocksMap. I counted 5 lookups
just in addstoredBlocks() while 
only one lookup is necessary because the name-space is locked and there is only thread that
modifies it.

      was (Author: shv):
    h2. Estimates.
The name-node startup consists of 4 major steps:
# load fsimage
# load edits
# save new merged fsimage
# process block report

I am estimating the name-node startup time based on a 10 million objects image.
The estimates are based on my experiments with a real cluster.
The numbers are scaled proportionally (4.3)  to 10 mln objects.

Under the assumption that each file on average has 1.5 blocks 10 mln objects 
translates into 4 mln files and directories, and 6 mln blocks.
Other claster parameters are summarized in the following table.

|objects|10 mln|
|files & dirs| 4 mln|
|blocks| 6 mln|
|heap size| 3.275 GB|
|image size| 0.6 GB|
|image load time| 132 sec|
|image save time| 70 sec|
|edits size per day| 0.27 GB|
|edits load time| 84 sec|
|data-nodes| 500|
|blocks per node| 36,000|
|block report processing| 320 sec|
|total startup time| 10 min|

This leads to the total startup time of 14 minutes, out of which
|load fsimage| 22%|
|load edits| 14%|
|save new fsimage| 11%|
|process block reports| 53%|

h2. Optimization.
# VM memory heap parameters play key role in the startup process as discussed in HADOOP-3248.
It is highly recommended to set the initial heap size -Xms close to the maximum heap size
because 
all that memory will be used by the name-node any way.
# Optimization of loading and saving is substantially related to reducing object allocation.
In case of loading, object allocations are unavoidable, so we will need to make sure
there are no intermediate allocations of temporary objects.
In case of saving, all object allocations should be eliminated, which is done in HADOOP-3248.
# During loading for each file or directory the name-node performs a lookup in the
namespace tree starting from the root in order to find the object's parent directory and then
to insert the object into it.
We can take advantage here of that directory children are not interleaving with children of
other directories in the image. 
So we can find a parent once and then add then include all its children without repeating
the search.
# Block reporting is the most expensive single part of the startup process.
According to my experiments most of the processing time here goes into first adding all blocks
into needed replication queue, 
and then removing them from the queue. During startup all blocks are guaranteed to be under-replicated
in the beginning.
But most of them and in the regular case all of them will be removed from that list.
So in a sense we create dynamically a huge temporary structure just to make sure that all
blocks have enough replicas.
In addition to that in pre HADOOP-2606 versions (before release 0.17) the replication monitor
would start processing
 those under-replicated blocks, and try to assign nodes for copying blocks.
The structure works fine during regular operation because it contains only those blocks that
are in fact under-replicated.
The processing of block reports goes 5 times faster if the addition to the needed replications
queue is removed.
# We also should reduce the number of block lookups in the blocksMap. I counted 5 lookups
just in addstoredBlocks() while 
only one lookup is necessary because the name-space is locked and there is only thread that
modifies it.
  
> Fast Cluster Restart
> --------------------
>
>                 Key: HADOOP-3022
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3022
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: dfs
>            Reporter: Robert Chansler
>            Assignee: Konstantin Shvachko
>             Fix For: 0.18.0
>
>
> This item introduces a discussion of how to reduce the time necessary to start a large
cluster from tens of minutes to a handful of minutes.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message