hadoop-common-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Chris Nauroth (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HADOOP-13560) S3ABlockOutputStream to support huge (many GB) file writes
Date Wed, 28 Sep 2016 18:51:20 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-13560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15530520#comment-15530520
] 

Chris Nauroth commented on HADOOP-13560:
----------------------------------------

At this point, I've read through all of patch 006 except for the documentation updates.  Very
nice patch!  Here is more feedback on the code, and I'll start reviewing the documentation
changes next.

I see the checks in {{S3ADataBlocks#validateWriteArgs}} are taken from {{java.io.OutputStream#write}}.
 Would you please add a note in the JavaDoc that this is implemented specifically to match
that contract?

It seems the validation check in {{DataBlock#verifyState}} would only ever throw an exception
if there was a bug in our S3A code's state transition logic, not a recoverable situation for
the client.  If so, then is a runtime exception more appropriate, such as {{IllegalStateException}}?
 If so, then that would also avoid the ignored exception in {{DataBlock#enterClosedState}}.

The JavaDoc for {{DataBlock#flush}} states that it is only valid in writing state.  Does it
make sense to move the validation check for that from {{DiskBlock#flush}} up into the base
class?

In {{ByteArrayBlock#startUpload}}, can the call to {{buffer.reset()}} be removed?  Resetting
sets an internal counter without releasing any of the underlying heap space, which is what
the {{buffer = null}} immediately afterwards achieves.

In the single-byte {{ByteBufferInputStream#read}}, I think checking {{hasRemaining()}} would
be more readable than handling {{BufferUnderflowException}}, and it would look consistent
with similar logic in the multi-byte {{read}}.



Please mark {{final}} on the member fields of {{ByteBufferBlock}} and {{DiskBlock}} where
appropriate.

{code}
      try {
        out.flush();
        out.close();
      } finally {
        out.close();
        out = null;
      }
{code}

I was a bit confused by the above code in {{DiskBlock#startUpload}}.  I think what this is
trying to do (and please correct me if I'm wrong) is to make sure we propagate an exception
thrown from {{flush}}.  Therefore, we can't simplify this code to just a single {{close}}
call, because even though {{BufferedOutputStream#close}} automatically flushes, it ignores
exceptions from flush, and we'd lose those error details.  If this is the intent, would you
please add a comment explaining it?

Should {{DiskBlock#startUpload}} wrap the returned stream in a {{BufferedInputStream}}?

Is {{ForwardingInputStream}} necessary?  It looks like {{FilterInputStream}}, so can {{FileDeletingInputStream}}
subclass that instead?

In {{MultiPartUpload#shouldRetry}}, is {{Thread.interrupted()}} supposed to be {{Thread.currentThread().interrupt()}}?
 If {{InterruptedException}} has been thrown, then the interrupted flag has been cleared already,
so I think {{Thread.interrupted()}} would effectively be a no-op.

Can we avoid creating the {{LocalDirAllocator}} when we are not using {{S3AOutputStream}}
or {{DiskBlockFactory}}?

I'm unclear why {{MAX_THREADS}} was downtuned to 1 in {{ITestS3ABlockingThreadPool}}.  If
that's intentional, please also update the class-level comment that states 2 threads and describe
why the test is still sufficient with only 1 thread.


> S3ABlockOutputStream to support huge (many GB) file writes
> ----------------------------------------------------------
>
>                 Key: HADOOP-13560
>                 URL: https://issues.apache.org/jira/browse/HADOOP-13560
>             Project: Hadoop Common
>          Issue Type: Sub-task
>          Components: fs/s3
>    Affects Versions: 2.9.0
>            Reporter: Steve Loughran
>            Assignee: Steve Loughran
>            Priority: Minor
>         Attachments: HADOOP-13560-branch-2-001.patch, HADOOP-13560-branch-2-002.patch,
HADOOP-13560-branch-2-003.patch, HADOOP-13560-branch-2-004.patch
>
>
> An AWS SDK [issue|https://github.com/aws/aws-sdk-java/issues/367] highlights that metadata
isn't copied on large copies.
> 1. Add a test to do that large copy/rname and verify that the copy really works
> 2. Verify that metadata makes it over.
> Verifying large file rename is important on its own, as it is needed for very large commit
operations for committers using rename



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: common-issues-help@hadoop.apache.org


Mime
View raw message