apex-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chaitanya Chebolu <chaita...@datatorrent.com>
Subject Re: [Malhar] @since 3.5.0 - S3InputModule is broken or does not function as documented.
Date Thu, 24 Nov 2016 10:47:47 GMT
Hi David,

I would suggest use s3n or s3a schema while using the S3InputModule. This
schema is used only for scanning the directories/files using Hadoop file

I used this module and it's working fine. S3InputModule unit test will run
on cluster only because it requires the Hadoop library.

S3InputModule doesn't process the same file. It periodically scan specified
directories/files for files which are newly added or modified. For each
discovered file it mark them as processed.

Please let me know, if you are still facing the issue.


On Thu, Nov 24, 2016 at 9:26 AM, <dashirov@yahoo.com> wrote:

> Hello,
> When using s3:// schema on amazon managed s3 bucket, the module attempts
> to retrieve the prefix with a leading / character which amazonaws does not
> recognize. When under a debugger the leading "/" is removed the module
> proceeds forward, but errors out downstream expecting a leading "/".
> Additionally, if the secret key generated by AWS contains / character,
> authentication will break. If the / character is replaced with URI escape
> sequence %2F authentication will break. The only way to pass auth in my
> case was to keep regenerating the keys until the secret key produced was
> free of  / or : characters. I'm pretty sure this is not going to cut it in
> production. Does anyone know what magic is called for to properly escape
> the offending characters from the s3(n) URI with the format as proposed by
> the module developers ?
> Has anyone had any success using S3InputModule? I haven't deployed the app
> to the EMR cluster yet, all tests are local accessing S3 buckets from
> outside of the amazon cloud. It seems the authors have setup their unit
> tests with s3n:// schema. Is there a way to replicate the original unit
> tests?
> Side note, current implementation does not persist the list of processed
> files anywhere outside of the running process. Nor does the logic allow for
> moving processed files into another bucket or marking them as complete.
> Does anyone know what was the original design though, in terms of
> protection against duplicate processing?
> Any insights would be greatly appreciated!
> -- David.

View raw message