apex-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chaitanya Chebolu <chaita...@datatorrent.com>
Subject Re: S3 Input Module
Date Mon, 21 Mar 2016 11:45:24 GMT
Hi Sandeep,

For Configuring Input Module, "files" is the mandatory configuration.
Description: List of files/directories to copy in comma separated fashion.

Parallel read depends on "readersCount" property. This represents the
number of block reader instances to read the file. By default, the value is
1.

For S3, user has to specify "files" property in the form of
SCHEME://AccessKey:SecretKey@BucketName/FileOrDirectory .
SCHEME://AccessKey:SecretKey@BucketName/FileOrDirectory , ....
This URL is specified by the Hadoop Library.

  Hadoop library supports the following File Systems for S3 and the schemes
are represented in their respective brackets:
1) S3 (s3)
2) NativeS3FileSystem (s3n)
3) S3AFileSystem (s3a)

  More info about these file systems, please refer the below link:
https://wiki.apache.org/hadoop/AmazonS3

S3AFileSystem was introduced in Hadoop-2.6 version and parallel read fix
was available from Hadoop-2.7+ version.

If the scheme is s3a and running on Hadoop-2.7+, then user could specify
the readersCount > 1. With these configurations,  Parallel read feature is
enabled.

If the scheme is s3a and running on < Hadoop-2.6 version, then the library
throws following error message:
"Scheme is not supported"

If the scheme is s3/s3n then there is single instance of Block Reader. So,
all the files are read sequentially. It impacts the performance.

Parallel read is completely depends on configuration. So, I don't need to
call any specific API for "Parallel Read" feature.

Regards,
Chaitanya

On Mon, Mar 21, 2016 at 11:38 AM, Sandeep Deshmukh <sandeep@datatorrent.com>
wrote:

> I have a little different thought process here. Many people face issues in
> S3 parallel read and if we are able to support parallel read in S3, that
> will add a lot of value in Apex-Malhar capabilities for S3 users.
>
> Although, eventually people will be using Hadoop 2.7+, current production
> users may not move quickly just for this purpose.
>
> Chaitanya: Could you please elaborate on no code change part?
>
> As I understand, there are different protocols that are supported in 2.7+
> and below 2.7. S3A is a new protocol that is supported in 2.7+ that will
> support parallel reads. So, essentially, you will need to configure your
> module differently for 2.7+ and below 2.7. That makes the user specify
> protocol explicitly for different Hadoop versions. Moreover, you will need
> a different configuration in FileSplitter as well that will emit the
> BlockMetaData based on Hadoop version.
>
> What are the general ways of handing such situations in open source
> community? How is such backporting done for dependent libraries?
>
> Regards
>
> Sandeep
> On 19-Mar-2016 9:42 am, "Yogi Devendra" <yogidevendra@apache.org> wrote:
>
> > Chaitanya,
> >
> > This means that those who are below Hadoop 2.7 will still have support
> for
> > S3 read. Thus, there is no loss of functionality for those users.
> >
> > It is just that, those having Hadoop 2.7+ would have better performance
> > using parallel read.
> >
> > Operator would seamlessly fall back to serial read when parallel read is
> > not possible.
> >
> > CMIIW.
> >
> > ~ Yogi
> >
> > On 19 March 2016 at 08:54, Thomas Weise <thomas@datatorrent.com> wrote:
> >
> > > Chaitanya,
> > >
> > > Thanks, that's good. I see it as a matter of documenting that the
> > parallel
> > > read will only work with Hadoop 2.7+.
> > >
> > > Thomas
> > >
> > > On Fri, Mar 18, 2016 at 10:40 AM, Chaitanya Chebolu <
> > > chaitanya@datatorrent.com> wrote:
> > >
> > > > Hi Thomas,
> > > >
> > > > When using Hadoop 2.7+ version, parallel read functionality
> > automatically
> > > > available. For this feature, no need to write any additional code.
> > > >
> > > > Regards,
> > > > Chaitanya
> > > >
> > > > On Fri, Mar 18, 2016 at 9:57 PM, Thomas Weise <
> thomas@datatorrent.com>
> > > > wrote:
> > > >
> > > > > Does the parallel read functionality automatically become available
> > > when
> > > > > using Hadoop 2.7 or later or do you have to write code against
> > > different
> > > > > API?
> > > > >
> > > > > Would prefer to not copy things from Hadoop in that case as most
> > users
> > > > are
> > > > > or will be soon on that version.
> > > > >
> > > > >
> > > > > On Fri, Mar 18, 2016 at 6:07 AM, Chaitanya Chebolu <
> > > > > chaitanya@datatorrent.com> wrote:
> > > > >
> > > > > > Yes Yogi. For Parallel read, we cannot support Hadoop versions
> > below
> > > > 2.7
> > > > > > without copying.
> > > > > > I suggested this approach.
> > > > > >
> > > > > >
> > > > > > On Fri, Mar 18, 2016 at 6:26 PM, Yogi Devendra <
> > > > yogidevendra@apache.org>
> > > > > > wrote:
> > > > > >
> > > > > > > Chaitanya,
> > > > > > >
> > > > > > > Do you mean to say that we cannot support hadoop versions
below
> > 2.7
> > > > > > without
> > > > > > > copying few files from Hadoop 2.7 implementation?
> > > > > > >
> > > > > > > CMIIW (Correct Me If I'm Wrong).
> > > > > > >
> > > > > > >
> > > > > > > ~ Yogi
> > > > > > >
> > > > > > > On 18 March 2016 at 18:05, Chaitanya Chebolu <
> > > > > chaitanya@datatorrent.com>
> > > > > > > wrote:
> > > > > > >
> > > > > > > > Hi Sandeep,
> > > > > > > >
> > > > > > > >   I am extending from HDFS input module. I am supporting
the
> > same
> > > > > > > features
> > > > > > > > for S3, which are supported by HDFS input module.
> > > > > > > >
> > > > > > > >   Please find my comments in-line.
> > > > > > > >
> > > > > > > > On Fri, Mar 18, 2016 at 12:07 PM, Sandeep Deshmukh
<
> > > > > > > > sandeep@datatorrent.com>
> > > > > > > > wrote:
> > > > > > > >
> > > > > > > > > Hi Chaitanya,
> > > > > > > > >
> > > > > > > > > I have a query on parallel reading via S3. Will
you be
> > > supporting
> > > > > > > > >
> > > > > > > > >    1. Reading one file in parallel ( say 4 block
readers
> > > reading
> > > > > the
> > > > > > > same
> > > > > > > > >    file
> > > > > > > > >
> > > > > > > >         Ans: I think this is similar to feature (3).
> > > > > > > >
> > > > > > > > >    2. Reading multiple files in parallel but
a file is
> always
> > > > read
> > > > > > > > >    serially. So different block reader instances
read
> > different
> > > > > files
> > > > > > > > >
> > > > > > > >         Ans: Yes. By configuring "sequentialFileRead"
> property
> > to
> > > > > true,
> > > > > > > > this feature is enabled.
> > > > > > > >
> > > > > > > > >    3. Mix of 1 and 2. Multiple files are read
in parallel,
> > and
> > > > > every
> > > > > > > file
> > > > > > > > >    in itself is also read in parallel.
> > > > > > > > >       Ans: Yes. By default, module enable this
feature.
> > > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > > There were issues while reading S3 files in parallel
for
> > > earlier
> > > > > > > versions
> > > > > > > > > of Hadoop : 2.2.0 or so and a lot better support
in 2.7.
> So,
> > > will
> > > > > > your
> > > > > > > > > module work on all Hadoop versions post 2.2 or
only 2.7?
> > > > > > > > >
> > > > > > > > >     Ans:  We would like to support parallel read
feature
> for
> > S3
> > > > > with
> > > > > > > > independent of Hadoop versions.
> > > > > > > >
> > > > > > > >     One way to support this feature is to copy few
S3 related
> > > files
> > > > > > from
> > > > > > > > Hadoop 2.7 version into the module and will use this
in
> module.
> > > > > > > >
> > > > > > > >     With this approach, S3 Module supports parallel
read with
> > > > > > independent
> > > > > > > > of Hadoop version.
> > > > > > > >
> > > > > > > > @All:
> > > > > > > >      Please share your thoughts on this approach.
> > > > > > > >
> > > > > > > > Regards,
> > > > > > > > Chaitanya
> > > > > > > >
> > > > > > > > Regards,
> > > > > > > > > Sandeep
> > > > > > > > >
> > > > > > > > > On Fri, Mar 18, 2016 at 10:49 AM, Pradeep Dalvi
<
> > > > > > > > > pradeep.dalvi@datatorrent.com> wrote:
> > > > > > > > >
> > > > > > > > > > +1
> > > > > > > > > >
> > > > > > > > > > On Thu, Mar 17, 2016 at 10:56 PM, Amol Kekre
<
> > > > > amol@datatorrent.com
> > > > > > >
> > > > > > > > > wrote:
> > > > > > > > > >
> > > > > > > > > > > +1. Very common use case. Nice to have
it.
> > > > > > > > > > >
> > > > > > > > > > > Thks
> > > > > > > > > > > Amol
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > On Thu, Mar 17, 2016 at 1:49 AM, Sandeep
Deshmukh <
> > > > > > > > > > sandeep@datatorrent.com
> > > > > > > > > > > >
> > > > > > > > > > > wrote:
> > > > > > > > > > >
> > > > > > > > > > > > +1
> > > > > > > > > > > >
> > > > > > > > > > > > Many people face issues while
copy data from S3 at
> > large
> > > > > scale.
> > > > > > > > This
> > > > > > > > > > > module
> > > > > > > > > > > > is a great contribution that can
be readily used with
> > > > simple
> > > > > > > > > > > configuration.
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > Regards,
> > > > > > > > > > > > Sandeep
> > > > > > > > > > > >
> > > > > > > > > > > > On Thu, Mar 17, 2016 at 2:04 PM,
Priyanka Gugale <
> > > > > > > > > > > priyanka@datatorrent.com
> > > > > > > > > > > > >
> > > > > > > > > > > > wrote:
> > > > > > > > > > > >
> > > > > > > > > > > > > It's a good idea to extract
out common code in
> parent
> > > > > class.
> > > > > > > > > > > > >
> > > > > > > > > > > > > +1 for this feature.
> > > > > > > > > > > > >
> > > > > > > > > > > > > -Priyanka
> > > > > > > > > > > > >
> > > > > > > > > > > > > On Thu, Mar 17, 2016 at 1:57
PM, Chaitanya Chebolu
> <
> > > > > > > > > > > > > chaitanya@datatorrent.com>
wrote:
> > > > > > > > > > > > >
> > > > > > > > > > > > > > Dear Community,
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >   I am proposing S3
Input Module. Primary
> > > functionality
> > > > > of
> > > > > > > this
> > > > > > > > > > > module
> > > > > > > > > > > > is
> > > > > > > > > > > > > > to parallel read files
from S3 bucket.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >   Below is the JIRA
created for this task:
> > > > > > > > > > > > > >
> > > https://issues.apache.org/jira/browse/APEXMALHAR-2019
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >   Design of this module
is similar to HDFS input
> > > > module.
> > > > > > So,
> > > > > > > I
> > > > > > > > > will
> > > > > > > > > > > > > extend
> > > > > > > > > > > > > > HDFS input module for
S3 module.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >   Instead of extending
HDFS input module, I will
> > > create
> > > > > > > common
> > > > > > > > > > class
> > > > > > > > > > > > for
> > > > > > > > > > > > > > all such file system
modules. JIRA for creating
> > > common
> > > > > > class
> > > > > > > is
> > > > > > > > > > here:
> > > > > > > > > > > > > >
> > > https://issues.apache.org/jira/browse/APEXMALHAR-2018
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >  Please share your thoughts
on this.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Regards,
> > > > > > > > > > > > > > Chaitanya
> > > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > --
> > > > > > > > > > Pradeep A. Dalvi
> > > > > > > > > >
> > > > > > > > > > Software Engineer
> > > > > > > > > > DataTorrent (India)
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message