hadoop-common-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Thomas (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (HADOOP-14535) Support for random access and seek of block blobs
Date Fri, 30 Jun 2017 23:23:00 GMT

     [ https://issues.apache.org/jira/browse/HADOOP-14535?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Thomas updated HADOOP-14535:
----------------------------
    Attachment: 0005-Random-access-and-seek-imporvements-to-azure-file-system.patch

I am attaching the updated patch (0005-Random-access-and-seek-improvements-to-azure-file-system.patch).
 Random access is as much as 90% faster for block blobs *without* any regressions.  There
are unit tests demonstrating the performance (see TestBlockBlobInputStream.java) improvement
for random access and unit tests demonstrating that there are no performance regressions in
sequential reads after reverse seeks.  

However, please note that unit tests and various developer machines are not an appropriate
environment for measuring performance.  The performance tests in TestBlockBlobInputStream.java
merely demonstrate the behavior and prevent regressions.  There are many things which can
impact performance measurements over short periods of time, such as but not limited to fluctuations
in network traffic and routing, fluctuations in activity of other processes running on the
client, fluctuations in load on the shared stamp that hosts your Azure Storage account, and
throttling sometimes performed by enterprise IT departments.  The performance tests included
with this change are written to execute quickly and work around these fluctuations, and prevent
regressions in the code.  In the process of implementing and running these unit tests, I also
validated the performance improvements by running variations of the code for longer periods
and the results looked favorable.

My team plans to review and improve the instrumentation (Hadoop Metrics) for the wasb:// file
system.  Although this change does not include new metrics, we will be looking into this in
the future.

ALL tests in "hadoop-tools/hadoop-azure" are passing with the patch (0005-Random-access-and-seek-improvements-to-azure-file-system.patch).

> Support for random access and seek of block blobs
> -------------------------------------------------
>
>                 Key: HADOOP-14535
>                 URL: https://issues.apache.org/jira/browse/HADOOP-14535
>             Project: Hadoop Common
>          Issue Type: Improvement
>          Components: fs/azure
>            Reporter: Thomas
>            Assignee: Thomas
>         Attachments: 0001-Random-access-and-seek-imporvements-to-azure-file-system.patch,
0003-Random-access-and-seek-imporvements-to-azure-file-system.patch, 0004-Random-access-and-seek-imporvements-to-azure-file-system.patch,
0005-Random-access-and-seek-imporvements-to-azure-file-system.patch
>
>
> This change adds a seek-able stream for reading block blobs to the wasb:// file system.
> If seek() is not used or if only forward seek() is used, the behavior of read() is unchanged.
> That is, the stream is optimized for sequential reads by reading chunks (over the network)
in
> the size specified by "fs.azure.read.request.size" (default is 4 megabytes).
> If reverse seek() is used, the behavior of read() changes in favor of reading the actual
number
> of bytes requested in the call to read(), with some constraints.  If the size requested
is smaller
> than 16 kilobytes and cannot be satisfied by the internal buffer, the network read will
be 16
> kilobytes.  If the size requested is greater than 4 megabytes, it will be satisfied by
sequential
> 4 megabyte reads over the network.
> This change improves the performance of FSInputStream.seek() by not closing and re-opening
the
> stream, which for block blobs also involves a network operation to read the blob metadata.
Now
> NativeAzureFsInputStream.seek() checks if the stream is seek-able and moves the read
position.
> [^attachment-name.zip]



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: common-issues-help@hadoop.apache.org


Mime
View raw message