hadoop-common-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Steve Loughran (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HADOOP-14535) Support for random access and seek of block blobs
Date Mon, 10 Jul 2017 17:12:00 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-14535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16080668#comment-16080668
] 

Steve Loughran commented on HADOOP-14535:
-----------------------------------------

This is getting pretty close to going in; most of my feedback is test related. 1+ iteration
should be enough. Once this and HADOOP-14598 are in, I can do some downstream testing with
real data.

h3. {{AzureNativeFileSystemStore}}

* How about naming the key of {{KEY_INPUT_STREAM_VERSION}} to "fs.azure.experimental.stream.version"?
That's be consistent with the "fs.s3a.experimental" term? 
* log @ debug choice of algorithm, to aid diagnostics
* {{retrieve()}} L2066: the {{PageBlobInputStream}} constructor already wraps StorageException
with IOE. Retrieve doesn't need to catch and translate them, so should catch and then rethrow
IOEs as is.

h3. {{BlockBlobInputStream}}

* a seek to the current position can be downgraded to a no-op; no need to close & reopen
the stream
* you don't need to go {{this.}} when referencing fields. We expect our IDEs to colour code
fields these days.
* can you have the {{else}} and the {{catch}} statements on the same line as the previous
clauses closing "}".
* {{read(byte[] buffer, ..)}}. Use {{FSInputStream.validatePositionedReadArgs}} for validation,
or at least as much of it as is relevant. FWIW, the order of checks matches that in InputStream.
* {{closeBlobInputStream}}: should {{blobInputStream=null}} be done in a {{finally}} clause
so that it is guaranteed to be set (so making the call non-reentrant)

h3. {{NativeAzureFileSystem}}

* L625: accidental typo in comment

h3. {{ContractTestUtils.java}}

revert move of {{elapsedTime()}} to a single line method, use multiline style for the new
{{elapsedTimeMs()}}. 


h3. {{TestBlockBlobInputStream}}

# I like the idea of using the ratio as a way of comparing performance; it makes it independent
of bandwidth.
# And I agree, you can't reliably assess real-world perf. But it would seem faster.
# Once HADOOP-14553 is in, this test would be uprated to a scale test; only executed with
the -Dscale option, 
and configurable for larger sizes of data. No need to worry about it. I think the tests could
perhaps even be moved into the [ITestAzureHugeFiles|https://github.com/steveloughran/hadoop/blob/azure/HADOOP-14553-testing/hadoop-tools/hadoop-azure/src/test/java/org/apache/hadoop/fs/azure/integration/ITestAzureHugeFiles.java]
test, which forces a strict ordering of tests in junit, so can have one test to upload a file,
one to delete, and some in between to play with reading and seeking.

for now

* {{TestBlockBlobInputStream}} to extend {{AbstractWasbTestBase}}. This will aid migration
to the parallel test runner of HADOOP-14553
* {{TestBlockBlobInputStream}} teardown only closes one of the input streams.
* {{toMbps()}}:  would it be better or worse to do the *8 before the / 1000.0? Or, given these
are floating point, moot?
* split {{testMarkSupported()}} into a separate test for each separate stream; assertion in
{{validateMarkSupported}} to include some text.
* same for {{testSkipAndAvailableAndPosition}}
* {{testSequentialReadPerformance}} are we confident that the {{v2ElapsedMs}} read time will
always be >0? Otherewise that division will fail.
* {{testRandomRead}} and {{testSequentialRead}} to always close the output stream. Or save
a refernce to the stream into a field and have the @After teardown close it (quietly)
* {{validateMarkAndReset, validateSkipBounds}} to use {{GenericTestUtils.assertExceptionContains}}
to validate caught exception, or
 {{LambdaTestUtils.intercept}} to structure expected failure. Have a look at other uses in
the code for details. +Same in other tests.

{code}
    try {
      seekCheck(in, dataLen + 3);
      Assert.fail("Seek after EOF should fail.");
    } catch (IOException e) {
      GenericTestUtils.assertExceptionContains("Cannot seek after EOF", e);
    }
{code}

LambdaTestUtils may seem a bit more convoluted

{code}    
    intercept(IOException.class, expected,
        new Callable<S3AEncryptionMethods>() {
          @Override
          public S3AEncryptionMethods call() throws Exception {
            return getAlgorithm(alg, key);
          }
        });
{code}    

But it really comes out to play in Java 8:

{code}    
    intercept(IOException.class, expected,
        () -> getAlgorithm(alg, key));
{code}    


That's why I'd recommend adopting it now.





Other

h3. {{AzureBlobStorageTestAccount}}

* L96; think some tabs have snuck in.
* I have problem in that every test run leaks wasb containers. Does this patch continue or
even worsen the tradition?



> Support for random access and seek of block blobs
> -------------------------------------------------
>
>                 Key: HADOOP-14535
>                 URL: https://issues.apache.org/jira/browse/HADOOP-14535
>             Project: Hadoop Common
>          Issue Type: Improvement
>          Components: fs/azure
>            Reporter: Thomas
>            Assignee: Thomas
>         Attachments: 0001-Random-access-and-seek-imporvements-to-azure-file-system.patch,
0003-Random-access-and-seek-imporvements-to-azure-file-system.patch, 0004-Random-access-and-seek-imporvements-to-azure-file-system.patch,
0005-Random-access-and-seek-imporvements-to-azure-file-system.patch
>
>
> This change adds a seek-able stream for reading block blobs to the wasb:// file system.
> If seek() is not used or if only forward seek() is used, the behavior of read() is unchanged.
> That is, the stream is optimized for sequential reads by reading chunks (over the network)
in
> the size specified by "fs.azure.read.request.size" (default is 4 megabytes).
> If reverse seek() is used, the behavior of read() changes in favor of reading the actual
number
> of bytes requested in the call to read(), with some constraints.  If the size requested
is smaller
> than 16 kilobytes and cannot be satisfied by the internal buffer, the network read will
be 16
> kilobytes.  If the size requested is greater than 4 megabytes, it will be satisfied by
sequential
> 4 megabyte reads over the network.
> This change improves the performance of FSInputStream.seek() by not closing and re-opening
the
> stream, which for block blobs also involves a network operation to read the blob metadata.
Now
> NativeAzureFsInputStream.seek() checks if the stream is seek-able and moves the read
position.
> [^attachment-name.zip]



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: common-issues-help@hadoop.apache.org


Mime
View raw message