apex-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chaitanya Chebolu <chaita...@datatorrent.com>
Subject Re: S3 Output Module
Date Thu, 20 Oct 2016 14:41:55 GMT
Hi All,

I am proposing the below new design for S3 Output Module using multi part
upload feature:

Input to this Module: FileMetadata, FileBlockMetadata, ReaderRecord

Steps for uploading files using S3 multipart feature:

=============================

   1.

   Initiate the upload. S3 will return upload id.

Mandatory : bucket name, file path

Note: Upload id is the unique identifier for multi part upload of a file.

   1.

   Upload each block using the received upload id. S3 will return ETag in
   response of each upload.

Mandatory: block number, upload id

   1.

   Send the merge request by providing the upload id and list of ETags .

Mandatory: upload id, file path, block ETags.

Here <http://docs.aws.amazon.com/AmazonS3/latest/dev/llJavaUploadFile.html>
is an example link for uploading a file using multi part feature:


I am proposing the below two approaches for S3 output module.


(Solution 1)

S3 Output Module consists of the below two operators:

1) BlockWriter : Write the blocks into the HDFS. Once successfully written
into HDFS, then this will emit the BlockMetadata.

2) S3MultiPartUpload: This consists of two parts:

     a) If the number of blocks of a file is > 1 then upload the blocks
using multi part feature. Otherwise, will upload the block using
putObject().

     b) Once all the blocks are successfully uploaded then will send the
merge complete request.


(Solution 2)

DAG for this solution as follows:

1) InitateS3Upload:

Input: FileMetadata

Initiates the upload. This operator emits (filemetadata, uploadId) to
S3FileMerger and (filePath, uploadId) to S3BlockUpload.

2) S3BlockUpload:

Input: FileBlockMetadata, ReaderRecord

Upload the blocks into S3. S3 will return ETag for each upload.
S3BlockUpload emits (path, ETag) to S3FileMerger.

3) S3FileMerger: Sends the file merge request to S3.

Pros:

(1) Supports the size of file to upload is up to 5 TB.

(2) Reduces the end to end latency. Because, we are not waiting to upload
until all the blocks of a file written to HDFS.

Please vote and share your thoughts on these approaches.

Regards,
Chaitanya

On Tue, Mar 29, 2016 at 2:35 PM, Chaitanya Chebolu <
chaitanya@datatorrent.com> wrote:

> @ Tushar
>
>   S3 Copy Output Module consists of following operators:
> 1) BlockWriter : Writes the blocks into the HDFS.
> 2) Synchronizer: Sends trigger to downstream operator, when all the blocks
> for a file written to HDFS.
> 3) FileMerger: Merges all the blocks into a file and will upload the
> merged file into S3 bucket.
>
> @ Ashwin
>
>     Good suggestion. In the first iteration, I will add the proposed
> design.
> Multipart support will add it in the next iteration.
>
> Regards,
> Chaitanya
>
> On Thu, Mar 24, 2016 at 2:44 AM, Ashwin Chandra Putta <
> ashwinchandrap@gmail.com> wrote:
>
>> +1 regarding the s3 upload functionality.
>>
>> However, I think we should just focus on multipart upload directly as it
>> comes with various advantages like higher throughput, faster recovery, not
>> needing to wait for entire file being created before uploading each part.
>> See: http://docs.aws.amazon.com/AmazonS3/latest/dev/uploadobjusin
>> gmpu.html
>>
>> Also, seems like we can do multipart upload if the file size is more than
>> 5MB. They do recommend using multipart if the file size is more than
>> 100MB.
>> I am not sure if there is a hard lower limit though. See:
>> http://docs.aws.amazon.com/AmazonS3/latest/dev/UploadingObjects.html
>>
>> This way, it seems like we don't to have to wait until a file is
>> completely
>> written to hdfs before performing the upload operation.
>>
>> Regards,
>> Ashwin.
>>
>> On Wed, Mar 23, 2016 at 5:10 AM, Tushar Gosavi <tushar@datatorrent.com>
>> wrote:
>>
>> > +1 , we need this functionality.
>> >
>> > Is it going to be a single operator or multiple operators? If multiple
>> > operators, then can you explain what functionality each operator will
>> > provide?
>> >
>> >
>> > Regards,
>> > -Tushar.
>> >
>> >
>> > On Wed, Mar 23, 2016 at 5:01 PM, Yogi Devendra <yogidevendra@apache.org
>> >
>> > wrote:
>> >
>> > > Writing to S3 is a common use-case for applications.
>> > > This module will be definitely helpful.
>> > >
>> > > +1 for adding this module.
>> > >
>> > >
>> > > ~ Yogi
>> > >
>> > > On 22 March 2016 at 13:52, Chaitanya Chebolu <
>> chaitanya@datatorrent.com>
>> > > wrote:
>> > >
>> > > > Hi All,
>> > > >
>> > > >   I am proposing S3 output copy Module. Primary functionality of
>> this
>> > > > module is uploading files to S3 bucket using block-by-block
>> approach.
>> > > >
>> > > >   Below is the JIRA created for this task:
>> > > > https://issues.apache.org/jira/browse/APEXMALHAR-2022
>> > > >
>> > > >   Design of this module is similar to HDFS copy module. So, I will
>> > extend
>> > > > HDFS copy module for S3.
>> > > >
>> > > > Design of this Module:
>> > > > =======================
>> > > > 1) Writing blocks into HDFS.
>> > > > 2) Merge the blocks into a file .
>> > > > 3) Upload the above merged file into S3 Bucket using AmazonS3Client
>> > > API's.
>> > > >
>> > > > Steps (1) & (2) are same as HDFS copy module.
>> > > >
>> > > > *Limitation:* Supports the size of file is up to 5 GB. Please refer
>> the
>> > > > below link about limitations of Uploading objects into S3:
>> > > > http://docs.aws.amazon.com/AmazonS3/latest/dev/UploadingObje
>> cts.html
>> > > >
>> > > > We can resolve the above limitation by using S3 Multipart feature.
I
>> > will
>> > > > add multipart support in next iteration.
>> > > >
>> > > >  Please share your thoughts on this.
>> > > >
>> > > > Regards,
>> > > > Chaitanya
>> > > >
>> > >
>> >
>>
>>
>>
>> --
>>
>> Regards,
>> Ashwin.
>>
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message