hadoop-common-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Thomas Marquardt (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HADOOP-14520) WASB: Block compaction for Azure Block Blobs
Date Sat, 19 Aug 2017 03:06:00 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-14520?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16133912#comment-16133912

Thomas Marquardt commented on HADOOP-14520:

I will hand this off to Georgi, as he is returning from vacation Monday.  I noticed the following
while reviewing the latest patches:

1) {{writeBlockRequestInternal}} has retry logic that returns the buffer to the pool and then
retries using the buffer that it just returned.

2) {{writeBlockRequestInternal}} is currently returning a byte array originally created by
{{ByteArrayOutputStream}} to the buffer pool.  If this is not clear, look at blockCompaction
where it creates {{ByteArrayOutputStreamInternal}}, then wraps the underlying {{byte[]}} in
a {{ByteBuffer}} and passes it to {{writeBlockRequestInternal}} which returns it to the pool.

3) {{blockCompaction}} can be refactored to make unit testing easy.  For example, extracting
out a {{getBlockSequenceForCompaction}} function that takes a block list as input and returns
a sequence of blocks to be compacted would allow a data driven unit test to run many different
block lists thru the algorithm.

4) I recommend the following description for the blockCompaction function:

 * Block compaction is only enabled when the number of blocks exceeds activateCompactionBlockCount.
 * The algorithm searches for the longest sequence of two or more blocks {b1, b2, ..., bn}
such that
 * size(b1) + size(b2) + ... + size(bn) < maximum-block-size.  It then downloads the blocks
in the
 * sequence, concatenates the data to form a single block, uploads this new block, and updates
the block
 * list to replace the sequence of blocks with the new block.

5) I recommend renaming {{BlockBlobAppendStream.bufferSize}} to {{maxBlockSize}}.  It is the
maximum size of a block.

> WASB: Block compaction for Azure Block Blobs
> --------------------------------------------
>                 Key: HADOOP-14520
>                 URL: https://issues.apache.org/jira/browse/HADOOP-14520
>             Project: Hadoop Common
>          Issue Type: Improvement
>          Components: fs/azure
>    Affects Versions: 3.0.0-alpha3
>            Reporter: Georgi Chalakov
>            Assignee: Georgi Chalakov
>         Attachments: HADOOP-14520-006.patch, HADOOP-14520-05.patch
> Block Compaction for WASB allows uploading new blocks for every hflush/hsync call. When
the number of blocks is above 32000, next hflush/hsync triggers the block compaction process.
Block compaction replaces a sequence of blocks with one block. From all the sequences with
total length less than 4M, compaction chooses the longest one. It is a greedy algorithm that
preserve all potential candidates for the next round. Block Compaction for WASB increases
data durability and allows using block blobs instead of page blobs. By default, block compaction
is disabled. Similar to the configuration for page blobs, the client needs to specify HDFS
folders where block compaction over block blobs is enabled. 
> Results for HADOOP-14520-05.patch
> tested endpoint: fs.azure.account.key.hdfs4.blob.core.windows.net
> Tests run: 707, Failures: 0, Errors: 0, Skipped: 119

This message was sent by Atlassian JIRA

To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: common-issues-help@hadoop.apache.org

View raw message