hadoop-common-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Eric Hanson (JIRA)" <j...@apache.org>
Subject [jira] [Created] (HADOOP-11188) hadoop-azure: automatically expand page blobs when they become full
Date Fri, 10 Oct 2014 18:04:37 GMT
Eric Hanson created HADOOP-11188:

             Summary: hadoop-azure: automatically expand page blobs when they become full
                 Key: HADOOP-11188
                 URL: https://issues.apache.org/jira/browse/HADOOP-11188
             Project: Hadoop Common
          Issue Type: Improvement
          Components: fs
            Reporter: Eric Hanson

Right now, page blobs are initialized to a fixed size (fs.azure.page.blob.size) and cannot
be expanded. This task is to make them automatically expand when they get to be nearly full.

Design: if a write occurs that does not have enough room in the file to finish, then flush
all preceding operations, extend the file, and complete the write. This will be synchronized
(to have exclusive access) in access to PageBlobOutputStream so there won't be race conditions.

The file will be extended by fs.azure.page.blob.extension.size bytes, which must be a multiple
of 512. The internal default for fs.azure.page.blob.extension size will be 128 * 1024 * 1024.
The minimum extension size will be 4 * 1024 * 1024 which is the maximum write size, so the
new write will finish. 

Extension will stop when the file size reaches 1TB. The final extension may be less than fs.azure.page.blob.extension.size
if the remainder (1TB - current_file_size) is smaller than fs.azure.page.blob.extension.size.

An alternative to this is to make the default size 1TB. This is much simpler to implement.
It's a one-line change. Or even simpler, don't change it at all because it is adequate for

Rationale for this file size extension feature:

1) be able to download files to local disk easily with CloudXplorer and similar tools. Downloading
a 1TB page blob is not practical if you don't have 1TB disk space since on the local side
it expands to the full file size, locally filled with zeros where there is no valid data.

2) don't make customers uncomfortable when they see large 1TB files. They often ask if they
have to pay for it, even though they only pay for the space actually used in the page blob.

I think rationale 2 is a relatively minor issue, because 98% of customers for HBase will never
notice. They will just use it and not look at what kind of files are used for the logs. They
don't pay for the unused space, so it is not a problem for them. We can document this. Also,
if they use hadoop fs -ls, they will see the actual size of the files since I put in a fix
for that.

Rationale 1 is a minor issue because you cannot interpret the data on your local file system
anyway due to the data format. So really, the only reason to copy data locally in its binary
format would be if you are moving it around or archiving it. Copying a 1TB page blob from
one location in the cloud to another is pretty fast with smart copy utilities that don't actually
move the 0-filled parts of the file.

Nevertheless, this is a convenience feature for users. They won't have to worry about setting
fs.azure.page.blob.size under normal circumstances and can make the files grow as big as they

If we make the change to extend the file size on the fly, that introduces new possible error
or failure modes for HBase. We should included retry logic. 

This message was sent by Atlassian JIRA

View raw message