hbase-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Hadoop QA (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HBASE-11729) Document HFile v3
Date Thu, 21 Aug 2014 13:06:12 GMT

    [ https://issues.apache.org/jira/browse/HBASE-11729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14105347#comment-14105347

Hadoop QA commented on HBASE-11729:

{color:red}-1 overall{color}.  Here are the results of testing the latest attachment 
  against trunk revision .
  ATTACHMENT ID: 12663339

    {color:green}+1 @author{color}.  The patch does not contain any @author tags.

    {color:green}+0 tests included{color}.  The patch appears to be a documentation patch
that doesn't require tests.

    {color:green}+1 javac{color}.  The applied patch does not increase the total number of
javac compiler warnings.

    {color:green}+1 javac{color}.  The applied patch does not increase the total number of
javac compiler warnings.

    {color:red}-1 javadoc{color}.  The javadoc tool appears to have generated 7 warning messages.

    {color:green}+1 findbugs{color}.  The patch does not introduce any new Findbugs (version
2.0.3) warnings.

    {color:green}+1 release audit{color}.  The applied patch does not increase the total number
of release audit warnings.

    {color:red}-1 lineLengths{color}.  The patch introduces the following lines longer than
    +    <para>As we will be discussing changes to the HFile format, it is useful to
give a short overview of the original (HFile version 1) format.</para>
+           <footnote><para>Image courtesy of Lars George, <link xlink:href="http://www.larsgeorge.com/2009/10/hbase-architecture-101-storage.html">hbase-architecture-101-storage.html</link>.</para></footnote>
+   <para>The number of entries in the block index is stored in the fixed file trailer,
and has to be passed in to the method that reads the block index. One of the limitations of
the block index in version 1 is that it does not provide the compressed size of a block, which
turns out to be necessary for decompression. Therefore, the HFile reader has to infer this
compressed size from the offset difference between blocks. We fix this limitation in version
2, where we store on-disk block size instead of uncompressed size, and get uncompressed size
from the block header.</para>
+   <para>We found it necessary to revise the HFile format after encountering high memory
usage and slow startup times caused by large Bloom filters and block indexes in the region
server. Bloom filters can get as large as 100 MB per HFile, which adds up to 2 GB when aggregated
over 20 regions. Block indexes can grow as large as 6 GB in aggregate size over the same set
of regions. A region is not considered opened until all of its block index data is loaded.
Large Bloom filters produce a different performance problem: the first get request that requires
a Bloom filter lookup will incur the latency of loading the entire Bloom filter bit array.</para>
+   <para>To speed up region server startup we break Bloom filters and block indexes
into multiple blocks and write those blocks out as they fill up, which also reduces the HFile
writer���s memory footprint. In the Bloom filter case, ���filling up a block���
means accumulating enough keys to efficiently utilize a fixed-size bit array, and in the block
index case we accumulate an ���index block��� of the desired size. Bloom filter
blocks and index blocks (we call these ���inline blocks���) become interspersed
with data blocks, and as a side effect we can no longer rely on the difference between block
offsets to determine data block length, as it was done in version 1.</para>
+   <para>HFile is a low-level file format by design, and it should not deal with application-specific
details such as Bloom filters, which are handled at StoreFile level. Therefore, we call Bloom
filter blocks in an HFile "inline" blocks. We also supply HFile with an interface to write
those inline blocks. </para>
+   <para>Another format modification aimed at reducing the region server startup time
is to use a contiguous ���load-on-open��� section that has to be loaded in memory
at the time an HFile is being opened. Currently, as an HFile opens, there are separate seek
operations to read the trailer, data/meta indexes, and file info. To read the Bloom filter,
there are two more seek operations for its ���data��� and ���meta���
portions. In version 2, we seek once to read the trailer and seek again to read everything
else we need to open the file from a contiguous block.</para></section>
+   <para>The version of HBase introducing the above features reads both version 1 and
2 HFiles, but only writes version 2 HFiles. A version 2 HFile is structured as follows:
+         <para>8 bytes: Block type, a sequence of bytes equivalent to version 1's "magic
records". Supported block types are: </para>
+                     INTERMEDIATE_INDEX ��� intermediate-level index blocks in a multi-level

  {color:green}+1 site{color}.  The mvn site goal succeeds with this patch.

     {color:red}-1 core tests{color}.  The patch failed these unit tests:

Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/10516//testReport/
Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/10516//artifact/patchprocess/newPatchFindbugsWarningshbase-protocol.html
Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/10516//artifact/patchprocess/newPatchFindbugsWarningshbase-examples.html
Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/10516//artifact/patchprocess/newPatchFindbugsWarningshbase-prefix-tree.html
Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/10516//artifact/patchprocess/newPatchFindbugsWarningshbase-client.html
Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/10516//artifact/patchprocess/newPatchFindbugsWarningshbase-hadoop2-compat.html
Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/10516//artifact/patchprocess/newPatchFindbugsWarningshbase-hadoop-compat.html
Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/10516//artifact/patchprocess/newPatchFindbugsWarningshbase-common.html
Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/10516//artifact/patchprocess/newPatchFindbugsWarningshbase-server.html
Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/10516//artifact/patchprocess/newPatchFindbugsWarningshbase-thrift.html
Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/10516//console

This message is automatically generated.

> Document HFile v3
> -----------------
>                 Key: HBASE-11729
>                 URL: https://issues.apache.org/jira/browse/HBASE-11729
>             Project: HBase
>          Issue Type: Task
>          Components: documentation, HFile
>    Affects Versions: 0.98.0
>            Reporter: Sean Busbey
>            Assignee: Sean Busbey
>            Priority: Trivial
>              Labels: beginner
>         Attachments: HBASE-11729-v2.patch, HBASE-11729-v2.pdf, HBASE-11729.patch, HBASE-11729.pdf
> 0.98 added HFile v3. There are a couple of mentions of it in the book on the sections
on cell tags, but there isn't an actual overview or design explanation like there is for [HFile

This message was sent by Atlassian JIRA

View raw message