hadoop-common-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Chris Nauroth (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (HADOOP-11188) hadoop-azure: automatically expand page blobs when they become full
Date Wed, 17 Dec 2014 21:37:16 GMT

     [ https://issues.apache.org/jira/browse/HADOOP-11188?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel

Chris Nauroth updated HADOOP-11188:
    Target Version/s: 2.7.0  (was: 3.0.0)

> hadoop-azure: automatically expand page blobs when they become full
> -------------------------------------------------------------------
>                 Key: HADOOP-11188
>                 URL: https://issues.apache.org/jira/browse/HADOOP-11188
>             Project: Hadoop Common
>          Issue Type: Improvement
>          Components: fs
>            Reporter: Eric Hanson
>            Assignee: Eric Hanson
>         Attachments: hadoop-11188.01.patch
> Right now, page blobs are initialized to a fixed size (fs.azure.page.blob.size) and cannot
be expanded. This task is to make them automatically expand when they get to be nearly full.
> Design: if a write occurs that does not have enough room in the file to finish, then
flush all preceding operations, extend the file, and complete the write. This will be synchronized
(to have exclusive access) in access to PageBlobOutputStream so there won't be race conditions.
> The file will be extended by fs.azure.page.blob.extension.size bytes, which must be a
multiple of 512. The internal default for fs.azure.page.blob.extension size will be 128 *
1024 * 1024. The minimum extension size will be 4 * 1024 * 1024 which is the maximum write
size, so the new write will finish. 
> Extension will stop when the file size reaches 1TB. The final extension may be less than
fs.azure.page.blob.extension.size if the remainder (1TB - current_file_size) is smaller than
> An alternative to this is to make the default size 1TB. This is much simpler to implement.
It's a one-line change. Or even simpler, don't change it at all because it is adequate for
> Rationale for this file size extension feature:
> 1) be able to download files to local disk easily with CloudXplorer and similar tools.
Downloading a 1TB page blob is not practical if you don't have 1TB disk space since on the
local side it expands to the full file size, locally filled with zeros where there is no valid
> 2) don't make customers uncomfortable when they see large 1TB files. They often ask if
they have to pay for it, even though they only pay for the space actually used in the page
> I think rationale 2 is a relatively minor issue, because 98% of customers for HBase will
never notice. They will just use it and not look at what kind of files are used for the logs.
They don't pay for the unused space, so it is not a problem for them. We can document this.
Also, if they use hadoop fs -ls, they will see the actual size of the files since I put in
a fix for that.
> Rationale 1 is a minor issue because you cannot interpret the data on your local file
system anyway due to the data format. So really, the only reason to copy data locally in its
binary format would be if you are moving it around or archiving it. Copying a 1TB page blob
from one location in the cloud to another is pretty fast with smart copy utilities that don't
actually move the 0-filled parts of the file.
> Nevertheless, this is a convenience feature for users. They won't have to worry about
setting fs.azure.page.blob.size under normal circumstances and can make the files grow as
big as they want.
> If we make the change to extend the file size on the fly, that introduces new possible
error or failure modes for HBase. We should included retry logic. 

This message was sent by Atlassian JIRA

View raw message