hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jing Zhao (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HDFS-7435) PB encoding of block reports is very inefficient
Date Tue, 03 Mar 2015 22:11:07 GMT

    [ https://issues.apache.org/jira/browse/HDFS-7435?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14345859#comment-14345859

Jing Zhao commented on HDFS-7435:

Thanks for working on this, Daryn. The patch looks good to me in general. Some comments below:
# Please add javadoc for BlockListAsLongs.
# I think we'd better not include the replica state for finalized replicas. Though it is encoded
using a varint, considering most of the blocks are finalized in the report, this can still
increase the total size of the report. How about not including the finalized state and write
an extra number in the header to indicate the total number of finalized replicas?
# For {{BlockReportLongIterator}}, it can be more clean to only keep either {{iter}} or {{currentBlockIndex}}.
# Now the iterator becomes an internal field of a BlockListAsLongs. I guess the main motivation
here is to support both new and old encoding but avoid having both a ByteString and a long[]
inside of BlockListsAsLongs? But in the meanwhile, this means there can be only one user of
{{iterator()}} in the whole life time of BlockListAsLongs? This usage pattern is true in the
current code, but is it good for BlockListAsLongs as a pure data structure?
# Also in {{getBlockReportIterator}}, if {{iter}} is null, we're actually creating a new {{BlockListAsLongs}}
instance and returning its iterator. Though this may not be hit by normal code path, this
may not be very clean.
# To address the above two comments, maybe what we can do is:
1) declare both the ByteString and the long[] in BlockListAsLongs and build iterator on top
2) or keep the old BlockListAsLongs class (but annotate it as deprecated) and use it for handling
the old style encoding. In this way we can have a more clean new BlockListAsLongs, and we
can delete the old one in the future.

> PB encoding of block reports is very inefficient
> ------------------------------------------------
>                 Key: HDFS-7435
>                 URL: https://issues.apache.org/jira/browse/HDFS-7435
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>          Components: datanode, namenode
>    Affects Versions: 2.0.0-alpha, 3.0.0
>            Reporter: Daryn Sharp
>            Assignee: Daryn Sharp
>            Priority: Critical
>         Attachments: HDFS-7435.000.patch, HDFS-7435.001.patch, HDFS-7435.002.patch, HDFS-7435.patch,
HDFS-7435.patch, HDFS-7435.patch, HDFS-7435.patch, HDFS-7435.patch
> Block reports are encoded as a PB repeating long.  Repeating fields use an {{ArrayList}}
with default capacity of 10.  A block report containing tens or hundreds of thousand of longs
(3 for each replica) is extremely expensive since the {{ArrayList}} must realloc many times.
 Also, decoding repeating fields will box the primitive longs which must then be unboxed.

This message was sent by Atlassian JIRA

View raw message