mxnet-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Bhavin Thaker <bhavintha...@gmail.com>
Subject Re: S3 Writes using SIG4 Authentication
Date Wed, 07 Mar 2018 17:56:26 GMT
Multi-part upload with finalization seems like a good approach for this
problem.

Bhavin Thaker.

On Wed, Mar 7, 2018 at 7:45 AM Naveen Swamy <mnnaveen@gmail.com> wrote:

> Rahul,
> IMO It is not Ok to write to a local file before streaming, you have to
> consider security implications such as:
> 1) will your local file be encrypted(encryption at rest)
> 2) what happens if the process crashes, you will have to make sure the
> local file is deleted in failure and process exit scenarios.
>
> My understanding is for multi part uploads it uses chunked transfer
> encoding and for that you do not need to know the total size and only know
> the chunked data size.
> https://en.wikipedia.org/wiki/Chunked_transfer_encoding
>
> See this SO answer:
>
> https://stackoverflow.com/questions/8653146/can-i-stream-a-file-upload-to-s3-without-a-content-length-header
>
> Can you point to the literature that asks to know the total size.
>
> -Naveen
>
>
> On Tue, Mar 6, 2018 at 10:34 PM, Rahul Huilgol <rahulhuilgol@gmail.com>
> wrote:
>
> > Hi Chris,
> >
> > S3 doesn't support append calls. They promote the use of multipart
> uploads
> > to upload large files in parallel, or when network reliability is an
> issue.
> > Writing like a stream does not seem to be the purpose of multipart
> uploads.
> >
> > I looked into what the AWS SDK does (in Java). It buffers in memory
> however
> > large the file might be, and then uploads. I imagine this involves
> > reallocating and copying the buffer to the larger buffer. There are few
> > issues raised regarding this on the sdk repos like this
> > <https://github.com/aws/aws-sdk-java/issues/474>. But this doesn't seem
> to
> > be something the SDKs can do anything about. People seem to be writing to
> > temporary files and then uploading.
> >
> > Regards,
> > Rahul
> >
> > On Tue, Mar 6, 2018 at 9:04 PM, Chris Olivier <cjolivier01@gmail.com>
> > wrote:
> >
> > > it seems strange that s3 would make such a major restriction. there’s
> > > literally no way to incrementally write a file without knowing the size
> > > beforehand? some sort of separate append calls, maybe?
> > >
> > > On Tue, Mar 6, 2018 at 8:53 PM Rahul Huilgol <rahulhuilgol@gmail.com>
> > > wrote:
> > >
> > > > Hi everyone,
> > > >
> > > > I have been looking at updating the authentication used by
> S3FileSystem
> > > in
> > > > dmlc-core. Current code uses Signature version 2, which works only in
> > the
> > > > region us-east-1 now. We need to update the authentication scheme to
> > use
> > > > Signature version 4 (SIG4).
> > > >
> > > > I've submitted a PR <https://github.com/dmlc/dmlc-core/pull/378>
to
> > > change
> > > > this for Reads. But I wanted to seek out thoughts on what to do for
> > > Writes,
> > > > as there is a potential problem.
> > > >
> > > > *How writes to S3 work currently:*
> > > > Whenever s3filesystem's stream.write() is called, data is buffered.
> > When
> > > > the buffer is full, a request is made to S3. Since this can happen
> > > multiple
> > > > times, multipart upload feature is used. An upload id is created when
> > > > stream is initialized. This upload id is used till the stream is
> > closed.
> > > > Default buffer size is 64MB.
> > > >
> > > > *Problem:*
> > > > The new SIG4 authentication scheme changes how multipart uploads
> work.
> > > Such
> > > > an upload now requires that we know the total size of data to be sent
> > > (sum
> > > > of sizes of all parts) when we create the first request itself. We
> need
> > > to
> > > > pass the total size of payload as part of header. This is not
> possible
> > > > given that we don't know all the write calls beforehand. For
> example, a
> > > > call to save model's parameters makes 145 calls to the stream's
> write.
> > > >
> > > > *Approach?*
> > > > Is it okay to buffer it to a local file, and then upload this file to
> > S3
> > > at
> > > > the end?
> > > > What use case do we have for writes to S3 generally? I believe we
> would
> > > > want to write params after training or logs. These wouldn't be too
> > large
> > > or
> > > > frequent I imagine. What would you suggest?
> > > >
> > > > Appreciate your thoughts and suggestions.
> > > >
> > > > Thanks,
> > > > Rahul Huilgol
> > > >
> > >
> >
> >
> >
> > --
> > Rahul Huilgol
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message