apex-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From AJAY GUPTA <ajaygit...@gmail.com>
Subject Re: [jira] [Assigned] (APEXMALHAR-2303) S3 Line By Line Module
Date Wed, 19 Oct 2016 13:13:01 GMT
Hi

I need suggestion of Apex dev community on the following.

For the S3RecordReader approach mentioned in previous mail, I am facing an
issue with determining the end of file.
Note that the input to this operator will not contain the file size.

Following approaches are possible

1) The S3 getObject() call which fetches file data within a range will
throw an AmazonS3Exception if the range provided is out of bounds. Hence if
file size is 10bytes and if I make a getObject request for 11 to 15, I will
get this exception.
Exception in thread "main"
com.amazonaws.services.s3.model.AmazonS3Exception: The requested range is
not satisfiable (Service: Amazon S3; Status Code: 416; Error Code:
InvalidRange; Request ID:
If this exception gets thrown, I can catch it in the code and conclude that
end of file is reached.

2) For every container running this application, maintain a map<filename,
filesize>. If the filesize already exists in this map, use from there. If
not, fetch the filesize information from S3 and add it to this map.

My own opinion is to go with the first approach since the number of calls
to S3 for getting file length will be less.
Kindly provide with any other approaches you can think of.


Thanks,
Ajay



On Wed, Oct 19, 2016 at 11:53 AM, AJAY GUPTA <ajaygit158@gmail.com> wrote:

> Hi Apex Dev community,
>
> Kindly provide with feedback if any for the following approach for
> implementing S3RecordReader.
>
> *S3RecordReader(delimited records)*
> *Input *: BlockMetaData containing offset and length
> *Expected Output :* Records in the block
> *Approach : *
> Similar to approach currently being followed in FSRecordReader.
> 1) Fetch the block from S3. S3 block fetch size should ideally be large
> enough, say 64MB to avoid unnecessary network delays.
> 2) Search for newline character in the block and emit the record
> 3) The last record in current block might overflow into subsequent block.
> For this, we will get a small part of subsequent block, say 1 MB and search
> for newline character and emit the record if newline character is found. We
> will fetch additional 1MB blocks till a newline charater is found.
> 4) We will also avoid reading the first record from all blocks (except
> first block) as this set of bytes is a part of last record in previous
> block.
>
>
> Regards,
> Ajay
>
>
>
> On Wed, Oct 19, 2016 at 7:31 AM, Ajay Gupta (JIRA) <jira@apache.org>
> wrote:
>
>>
>>      [ https://issues.apache.org/jira/browse/APEXMALHAR-2303?page=
>> com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
>>
>> Ajay Gupta reassigned APEXMALHAR-2303:
>> --------------------------------------
>>
>>     Assignee: Ajay Gupta
>>
>> > S3 Line By Line Module
>> > ----------------------
>> >
>> >                 Key: APEXMALHAR-2303
>> >                 URL: https://issues.apache.org/jira
>> /browse/APEXMALHAR-2303
>> >             Project: Apache Apex Malhar
>> >          Issue Type: Bug
>> >            Reporter: Ajay Gupta
>> >            Assignee: Ajay Gupta
>> >   Original Estimate: 336h
>> >  Remaining Estimate: 336h
>> >
>> > This is a new module which will consist of 2 operators
>> > 1) File Splitter -- Already existing in Malhar library
>> > 2) S3RecordReader -- Read a file from S3 and output the records
>> (delimited or fixed width)
>>
>>
>>
>> --
>> This message was sent by Atlassian JIRA
>> (v6.3.4#6332)
>>
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message