hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Haohui Mai (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HDFS-5698) Use protobuf to serialize / deserialize FSImage
Date Thu, 30 Jan 2014 00:06:17 GMT

    [ https://issues.apache.org/jira/browse/HDFS-5698?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13886041#comment-13886041

Haohui Mai commented on HDFS-5698:

I prototyped a parallelized version of the FSImageLoader. Here is the number running on the
same environment:

|Size in Old|512M|1G|2G|4G|8G|
|Loading in Old (ms)|12819|24664|48240|114090|307689|
|Loading in PB(Parallel) (ms)|17927|32997|64581|138306|373391|

The main changes of the prototype are:
# Compute the MD5 checksum in a separate thread.
# Parallelize the construction of INodes (with 6 threads)
# Use a dedicated thread to update the blocks map.

Currently I haven't parallelize the construction of INodeDirectory yet, but it seems to me
that the performance numbers are reasonable (recall that coming out the safe mode automatically
will take 30 seconds). I think we can fine tune the performance after the work is merged into

> Use protobuf to serialize / deserialize FSImage
> -----------------------------------------------
>                 Key: HDFS-5698
>                 URL: https://issues.apache.org/jira/browse/HDFS-5698
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>            Reporter: Haohui Mai
>            Assignee: Haohui Mai
>         Attachments: HDFS-5698.000.patch, HDFS-5698.001.patch
> Currently, the code serializes FSImage using in-house serialization mechanisms. There
are a couple disadvantages of the current approach:
> # Mixing the responsibility of reconstruction and serialization / deserialization. The
current code paths of serialization / deserialization have spent a lot of effort on maintaining
compatibility. What is worse is that they are mixed with the complex logic of reconstructing
the namespace, making the code difficult to follow.
> # Poor documentation of the current FSImage format. The format of the FSImage is practically
defined by the implementation. An bug in implementation means a bug in the specification.
Furthermore, it also makes writing third-party tools quite difficult.
> # Changing schemas is non-trivial. Adding a field in FSImage requires bumping the layout
version every time. Bumping out layout version requires (1) the users to explicitly upgrade
the clusters, and (2) putting new code to maintain backward compatibility.
> This jira proposes to use protobuf to serialize the FSImage. Protobuf has been used to
serialize / deserialize the RPC message in Hadoop.
> Protobuf addresses all the above problems. It clearly separates the responsibility of
serialization and reconstructing the namespace. The protobuf files document the current format
of the FSImage. The developers now can add optional fields with ease, since the old code can
always read the new FSImage.

This message was sent by Atlassian JIRA

View raw message