apex-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chaitanya Chebolu <chaita...@datatorrent.com>
Subject Re: S3 Input Module
Date Thu, 24 Mar 2016 08:18:10 GMT
@Ashwin:

   Please find the below comments:
1) No.
2) Yes. All the blocks of a file read the same reader instance in the case
of Hadoop version < 2.6

If we won't copy s3a files then missing the parallel reader functionality
for Hadoop version < 2.6.

Regards,
Chaitanya

On Thu, Mar 24, 2016 at 12:25 AM, Ashwin Chandra Putta <
ashwinchandrap@gmail.com> wrote:

> Chaitanya,
>
> For hadoop version < 2.6,
>
> 1. Is the readersCount value forced to 1 irrespective of the value
> configured by user?
> 2. Is it possible to allow for parallel file reads i.e. 1 reader per file?
>
> Also, just to confirm. No more copying s3a files from hadoop for previous
> versions right?
>
> Regards,
> Ashwin.
>
> On Mon, Mar 21, 2016 at 4:45 AM, Chaitanya Chebolu <
> chaitanya@datatorrent.com> wrote:
>
> > Hi Sandeep,
> >
> > For Configuring Input Module, "files" is the mandatory configuration.
> > Description: List of files/directories to copy in comma separated
> fashion.
> >
> > Parallel read depends on "readersCount" property. This represents the
> > number of block reader instances to read the file. By default, the value
> is
> > 1.
> >
> > For S3, user has to specify "files" property in the form of
> > SCHEME://AccessKey:SecretKey@BucketName/FileOrDirectory .
> > SCHEME://AccessKey:SecretKey@BucketName/FileOrDirectory , ....
> > This URL is specified by the Hadoop Library.
> >
> >   Hadoop library supports the following File Systems for S3 and the
> schemes
> > are represented in their respective brackets:
> > 1) S3 (s3)
> > 2) NativeS3FileSystem (s3n)
> > 3) S3AFileSystem (s3a)
> >
> >   More info about these file systems, please refer the below link:
> > https://wiki.apache.org/hadoop/AmazonS3
> >
> > S3AFileSystem was introduced in Hadoop-2.6 version and parallel read fix
> > was available from Hadoop-2.7+ version.
> >
> > If the scheme is s3a and running on Hadoop-2.7+, then user could specify
> > the readersCount > 1. With these configurations,  Parallel read feature
> is
> > enabled.
> >
> > If the scheme is s3a and running on < Hadoop-2.6 version, then the
> library
> > throws following error message:
> > "Scheme is not supported"
> >
> > If the scheme is s3/s3n then there is single instance of Block Reader.
> So,
> > all the files are read sequentially. It impacts the performance.
> >
> > Parallel read is completely depends on configuration. So, I don't need to
> > call any specific API for "Parallel Read" feature.
> >
> > Regards,
> > Chaitanya
> >
> > On Mon, Mar 21, 2016 at 11:38 AM, Sandeep Deshmukh <
> > sandeep@datatorrent.com>
> > wrote:
> >
> > > I have a little different thought process here. Many people face issues
> > in
> > > S3 parallel read and if we are able to support parallel read in S3,
> that
> > > will add a lot of value in Apex-Malhar capabilities for S3 users.
> > >
> > > Although, eventually people will be using Hadoop 2.7+, current
> production
> > > users may not move quickly just for this purpose.
> > >
> > > Chaitanya: Could you please elaborate on no code change part?
> > >
> > > As I understand, there are different protocols that are supported in
> 2.7+
> > > and below 2.7. S3A is a new protocol that is supported in 2.7+ that
> will
> > > support parallel reads. So, essentially, you will need to configure
> your
> > > module differently for 2.7+ and below 2.7. That makes the user specify
> > > protocol explicitly for different Hadoop versions. Moreover, you will
> > need
> > > a different configuration in FileSplitter as well that will emit the
> > > BlockMetaData based on Hadoop version.
> > >
> > > What are the general ways of handing such situations in open source
> > > community? How is such backporting done for dependent libraries?
> > >
> > > Regards
> > >
> > > Sandeep
> > > On 19-Mar-2016 9:42 am, "Yogi Devendra" <yogidevendra@apache.org>
> wrote:
> > >
> > > > Chaitanya,
> > > >
> > > > This means that those who are below Hadoop 2.7 will still have
> support
> > > for
> > > > S3 read. Thus, there is no loss of functionality for those users.
> > > >
> > > > It is just that, those having Hadoop 2.7+ would have better
> performance
> > > > using parallel read.
> > > >
> > > > Operator would seamlessly fall back to serial read when parallel read
> > is
> > > > not possible.
> > > >
> > > > CMIIW.
> > > >
> > > > ~ Yogi
> > > >
> > > > On 19 March 2016 at 08:54, Thomas Weise <thomas@datatorrent.com>
> > wrote:
> > > >
> > > > > Chaitanya,
> > > > >
> > > > > Thanks, that's good. I see it as a matter of documenting that the
> > > > parallel
> > > > > read will only work with Hadoop 2.7+.
> > > > >
> > > > > Thomas
> > > > >
> > > > > On Fri, Mar 18, 2016 at 10:40 AM, Chaitanya Chebolu <
> > > > > chaitanya@datatorrent.com> wrote:
> > > > >
> > > > > > Hi Thomas,
> > > > > >
> > > > > > When using Hadoop 2.7+ version, parallel read functionality
> > > > automatically
> > > > > > available. For this feature, no need to write any additional
> code.
> > > > > >
> > > > > > Regards,
> > > > > > Chaitanya
> > > > > >
> > > > > > On Fri, Mar 18, 2016 at 9:57 PM, Thomas Weise <
> > > thomas@datatorrent.com>
> > > > > > wrote:
> > > > > >
> > > > > > > Does the parallel read functionality automatically become
> > available
> > > > > when
> > > > > > > using Hadoop 2.7 or later or do you have to write code
against
> > > > > different
> > > > > > > API?
> > > > > > >
> > > > > > > Would prefer to not copy things from Hadoop in that case
as
> most
> > > > users
> > > > > > are
> > > > > > > or will be soon on that version.
> > > > > > >
> > > > > > >
> > > > > > > On Fri, Mar 18, 2016 at 6:07 AM, Chaitanya Chebolu <
> > > > > > > chaitanya@datatorrent.com> wrote:
> > > > > > >
> > > > > > > > Yes Yogi. For Parallel read, we cannot support Hadoop
> versions
> > > > below
> > > > > > 2.7
> > > > > > > > without copying.
> > > > > > > > I suggested this approach.
> > > > > > > >
> > > > > > > >
> > > > > > > > On Fri, Mar 18, 2016 at 6:26 PM, Yogi Devendra <
> > > > > > yogidevendra@apache.org>
> > > > > > > > wrote:
> > > > > > > >
> > > > > > > > > Chaitanya,
> > > > > > > > >
> > > > > > > > > Do you mean to say that we cannot support hadoop
versions
> > below
> > > > 2.7
> > > > > > > > without
> > > > > > > > > copying few files from Hadoop 2.7 implementation?
> > > > > > > > >
> > > > > > > > > CMIIW (Correct Me If I'm Wrong).
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > ~ Yogi
> > > > > > > > >
> > > > > > > > > On 18 March 2016 at 18:05, Chaitanya Chebolu
<
> > > > > > > chaitanya@datatorrent.com>
> > > > > > > > > wrote:
> > > > > > > > >
> > > > > > > > > > Hi Sandeep,
> > > > > > > > > >
> > > > > > > > > >   I am extending from HDFS input module.
I am supporting
> > the
> > > > same
> > > > > > > > > features
> > > > > > > > > > for S3, which are supported by HDFS input
module.
> > > > > > > > > >
> > > > > > > > > >   Please find my comments in-line.
> > > > > > > > > >
> > > > > > > > > > On Fri, Mar 18, 2016 at 12:07 PM, Sandeep
Deshmukh <
> > > > > > > > > > sandeep@datatorrent.com>
> > > > > > > > > > wrote:
> > > > > > > > > >
> > > > > > > > > > > Hi Chaitanya,
> > > > > > > > > > >
> > > > > > > > > > > I have a query on parallel reading
via S3. Will you be
> > > > > supporting
> > > > > > > > > > >
> > > > > > > > > > >    1. Reading one file in parallel
( say 4 block
> readers
> > > > > reading
> > > > > > > the
> > > > > > > > > same
> > > > > > > > > > >    file
> > > > > > > > > > >
> > > > > > > > > >         Ans: I think this is similar to
feature (3).
> > > > > > > > > >
> > > > > > > > > > >    2. Reading multiple files in parallel
but a file is
> > > always
> > > > > > read
> > > > > > > > > > >    serially. So different block reader
instances read
> > > > different
> > > > > > > files
> > > > > > > > > > >
> > > > > > > > > >         Ans: Yes. By configuring "sequentialFileRead"
> > > property
> > > > to
> > > > > > > true,
> > > > > > > > > > this feature is enabled.
> > > > > > > > > >
> > > > > > > > > > >    3. Mix of 1 and 2. Multiple files
are read in
> > parallel,
> > > > and
> > > > > > > every
> > > > > > > > > file
> > > > > > > > > > >    in itself is also read in parallel.
> > > > > > > > > > >       Ans: Yes. By default, module
enable this feature.
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > > There were issues while reading S3
files in parallel
> for
> > > > > earlier
> > > > > > > > > versions
> > > > > > > > > > > of Hadoop : 2.2.0 or so and a lot better
support in
> 2.7.
> > > So,
> > > > > will
> > > > > > > > your
> > > > > > > > > > > module work on all Hadoop versions
post 2.2 or only
> 2.7?
> > > > > > > > > > >
> > > > > > > > > > >     Ans:  We would like to support
parallel read
> feature
> > > for
> > > > S3
> > > > > > > with
> > > > > > > > > > independent of Hadoop versions.
> > > > > > > > > >
> > > > > > > > > >     One way to support this feature is to
copy few S3
> > related
> > > > > files
> > > > > > > > from
> > > > > > > > > > Hadoop 2.7 version into the module and will
use this in
> > > module.
> > > > > > > > > >
> > > > > > > > > >     With this approach, S3 Module supports
parallel read
> > with
> > > > > > > > independent
> > > > > > > > > > of Hadoop version.
> > > > > > > > > >
> > > > > > > > > > @All:
> > > > > > > > > >      Please share your thoughts on this
approach.
> > > > > > > > > >
> > > > > > > > > > Regards,
> > > > > > > > > > Chaitanya
> > > > > > > > > >
> > > > > > > > > > Regards,
> > > > > > > > > > > Sandeep
> > > > > > > > > > >
> > > > > > > > > > > On Fri, Mar 18, 2016 at 10:49 AM, Pradeep
Dalvi <
> > > > > > > > > > > pradeep.dalvi@datatorrent.com> wrote:
> > > > > > > > > > >
> > > > > > > > > > > > +1
> > > > > > > > > > > >
> > > > > > > > > > > > On Thu, Mar 17, 2016 at 10:56
PM, Amol Kekre <
> > > > > > > amol@datatorrent.com
> > > > > > > > >
> > > > > > > > > > > wrote:
> > > > > > > > > > > >
> > > > > > > > > > > > > +1. Very common use case.
Nice to have it.
> > > > > > > > > > > > >
> > > > > > > > > > > > > Thks
> > > > > > > > > > > > > Amol
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > On Thu, Mar 17, 2016 at 1:49
AM, Sandeep Deshmukh <
> > > > > > > > > > > > sandeep@datatorrent.com
> > > > > > > > > > > > > >
> > > > > > > > > > > > > wrote:
> > > > > > > > > > > > >
> > > > > > > > > > > > > > +1
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Many people face issues
while copy data from S3
> at
> > > > large
> > > > > > > scale.
> > > > > > > > > > This
> > > > > > > > > > > > > module
> > > > > > > > > > > > > > is a great contribution
that can be readily used
> > with
> > > > > > simple
> > > > > > > > > > > > > configuration.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Regards,
> > > > > > > > > > > > > > Sandeep
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > On Thu, Mar 17, 2016
at 2:04 PM, Priyanka Gugale
> <
> > > > > > > > > > > > > priyanka@datatorrent.com
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > > It's a good idea
to extract out common code in
> > > parent
> > > > > > > class.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > +1 for this feature.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > -Priyanka
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > On Thu, Mar 17,
2016 at 1:57 PM, Chaitanya
> > Chebolu
> > > <
> > > > > > > > > > > > > > > chaitanya@datatorrent.com>
wrote:
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Dear Community,
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >   I am proposing
S3 Input Module. Primary
> > > > > functionality
> > > > > > > of
> > > > > > > > > this
> > > > > > > > > > > > > module
> > > > > > > > > > > > > > is
> > > > > > > > > > > > > > > > to parallel
read files from S3 bucket.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >   Below is
the JIRA created for this task:
> > > > > > > > > > > > > > > >
> > > > > https://issues.apache.org/jira/browse/APEXMALHAR-2019
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >   Design of
this module is similar to HDFS
> > input
> > > > > > module.
> > > > > > > > So,
> > > > > > > > > I
> > > > > > > > > > > will
> > > > > > > > > > > > > > > extend
> > > > > > > > > > > > > > > > HDFS input
module for S3 module.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >   Instead
of extending HDFS input module, I
> > will
> > > > > create
> > > > > > > > > common
> > > > > > > > > > > > class
> > > > > > > > > > > > > > for
> > > > > > > > > > > > > > > > all such file
system modules. JIRA for
> creating
> > > > > common
> > > > > > > > class
> > > > > > > > > is
> > > > > > > > > > > > here:
> > > > > > > > > > > > > > > >
> > > > > https://issues.apache.org/jira/browse/APEXMALHAR-2018
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >  Please share
your thoughts on this.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Regards,
> > > > > > > > > > > > > > > > Chaitanya
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > --
> > > > > > > > > > > > Pradeep A. Dalvi
> > > > > > > > > > > >
> > > > > > > > > > > > Software Engineer
> > > > > > > > > > > > DataTorrent (India)
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>
>
>
> --
>
> Regards,
> Ashwin.
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message