apex-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ashwin Chandra Putta <ashwinchand...@gmail.com>
Subject Re: S3 Input Module
Date Wed, 23 Mar 2016 18:55:20 GMT
Chaitanya,

For hadoop version < 2.6,

1. Is the readersCount value forced to 1 irrespective of the value
configured by user?
2. Is it possible to allow for parallel file reads i.e. 1 reader per file?

Also, just to confirm. No more copying s3a files from hadoop for previous
versions right?

Regards,
Ashwin.

On Mon, Mar 21, 2016 at 4:45 AM, Chaitanya Chebolu <
chaitanya@datatorrent.com> wrote:

> Hi Sandeep,
>
> For Configuring Input Module, "files" is the mandatory configuration.
> Description: List of files/directories to copy in comma separated fashion.
>
> Parallel read depends on "readersCount" property. This represents the
> number of block reader instances to read the file. By default, the value is
> 1.
>
> For S3, user has to specify "files" property in the form of
> SCHEME://AccessKey:SecretKey@BucketName/FileOrDirectory .
> SCHEME://AccessKey:SecretKey@BucketName/FileOrDirectory , ....
> This URL is specified by the Hadoop Library.
>
>   Hadoop library supports the following File Systems for S3 and the schemes
> are represented in their respective brackets:
> 1) S3 (s3)
> 2) NativeS3FileSystem (s3n)
> 3) S3AFileSystem (s3a)
>
>   More info about these file systems, please refer the below link:
> https://wiki.apache.org/hadoop/AmazonS3
>
> S3AFileSystem was introduced in Hadoop-2.6 version and parallel read fix
> was available from Hadoop-2.7+ version.
>
> If the scheme is s3a and running on Hadoop-2.7+, then user could specify
> the readersCount > 1. With these configurations,  Parallel read feature is
> enabled.
>
> If the scheme is s3a and running on < Hadoop-2.6 version, then the library
> throws following error message:
> "Scheme is not supported"
>
> If the scheme is s3/s3n then there is single instance of Block Reader. So,
> all the files are read sequentially. It impacts the performance.
>
> Parallel read is completely depends on configuration. So, I don't need to
> call any specific API for "Parallel Read" feature.
>
> Regards,
> Chaitanya
>
> On Mon, Mar 21, 2016 at 11:38 AM, Sandeep Deshmukh <
> sandeep@datatorrent.com>
> wrote:
>
> > I have a little different thought process here. Many people face issues
> in
> > S3 parallel read and if we are able to support parallel read in S3, that
> > will add a lot of value in Apex-Malhar capabilities for S3 users.
> >
> > Although, eventually people will be using Hadoop 2.7+, current production
> > users may not move quickly just for this purpose.
> >
> > Chaitanya: Could you please elaborate on no code change part?
> >
> > As I understand, there are different protocols that are supported in 2.7+
> > and below 2.7. S3A is a new protocol that is supported in 2.7+ that will
> > support parallel reads. So, essentially, you will need to configure your
> > module differently for 2.7+ and below 2.7. That makes the user specify
> > protocol explicitly for different Hadoop versions. Moreover, you will
> need
> > a different configuration in FileSplitter as well that will emit the
> > BlockMetaData based on Hadoop version.
> >
> > What are the general ways of handing such situations in open source
> > community? How is such backporting done for dependent libraries?
> >
> > Regards
> >
> > Sandeep
> > On 19-Mar-2016 9:42 am, "Yogi Devendra" <yogidevendra@apache.org> wrote:
> >
> > > Chaitanya,
> > >
> > > This means that those who are below Hadoop 2.7 will still have support
> > for
> > > S3 read. Thus, there is no loss of functionality for those users.
> > >
> > > It is just that, those having Hadoop 2.7+ would have better performance
> > > using parallel read.
> > >
> > > Operator would seamlessly fall back to serial read when parallel read
> is
> > > not possible.
> > >
> > > CMIIW.
> > >
> > > ~ Yogi
> > >
> > > On 19 March 2016 at 08:54, Thomas Weise <thomas@datatorrent.com>
> wrote:
> > >
> > > > Chaitanya,
> > > >
> > > > Thanks, that's good. I see it as a matter of documenting that the
> > > parallel
> > > > read will only work with Hadoop 2.7+.
> > > >
> > > > Thomas
> > > >
> > > > On Fri, Mar 18, 2016 at 10:40 AM, Chaitanya Chebolu <
> > > > chaitanya@datatorrent.com> wrote:
> > > >
> > > > > Hi Thomas,
> > > > >
> > > > > When using Hadoop 2.7+ version, parallel read functionality
> > > automatically
> > > > > available. For this feature, no need to write any additional code.
> > > > >
> > > > > Regards,
> > > > > Chaitanya
> > > > >
> > > > > On Fri, Mar 18, 2016 at 9:57 PM, Thomas Weise <
> > thomas@datatorrent.com>
> > > > > wrote:
> > > > >
> > > > > > Does the parallel read functionality automatically become
> available
> > > > when
> > > > > > using Hadoop 2.7 or later or do you have to write code against
> > > > different
> > > > > > API?
> > > > > >
> > > > > > Would prefer to not copy things from Hadoop in that case as
most
> > > users
> > > > > are
> > > > > > or will be soon on that version.
> > > > > >
> > > > > >
> > > > > > On Fri, Mar 18, 2016 at 6:07 AM, Chaitanya Chebolu <
> > > > > > chaitanya@datatorrent.com> wrote:
> > > > > >
> > > > > > > Yes Yogi. For Parallel read, we cannot support Hadoop versions
> > > below
> > > > > 2.7
> > > > > > > without copying.
> > > > > > > I suggested this approach.
> > > > > > >
> > > > > > >
> > > > > > > On Fri, Mar 18, 2016 at 6:26 PM, Yogi Devendra <
> > > > > yogidevendra@apache.org>
> > > > > > > wrote:
> > > > > > >
> > > > > > > > Chaitanya,
> > > > > > > >
> > > > > > > > Do you mean to say that we cannot support hadoop versions
> below
> > > 2.7
> > > > > > > without
> > > > > > > > copying few files from Hadoop 2.7 implementation?
> > > > > > > >
> > > > > > > > CMIIW (Correct Me If I'm Wrong).
> > > > > > > >
> > > > > > > >
> > > > > > > > ~ Yogi
> > > > > > > >
> > > > > > > > On 18 March 2016 at 18:05, Chaitanya Chebolu <
> > > > > > chaitanya@datatorrent.com>
> > > > > > > > wrote:
> > > > > > > >
> > > > > > > > > Hi Sandeep,
> > > > > > > > >
> > > > > > > > >   I am extending from HDFS input module. I am
supporting
> the
> > > same
> > > > > > > > features
> > > > > > > > > for S3, which are supported by HDFS input module.
> > > > > > > > >
> > > > > > > > >   Please find my comments in-line.
> > > > > > > > >
> > > > > > > > > On Fri, Mar 18, 2016 at 12:07 PM, Sandeep Deshmukh
<
> > > > > > > > > sandeep@datatorrent.com>
> > > > > > > > > wrote:
> > > > > > > > >
> > > > > > > > > > Hi Chaitanya,
> > > > > > > > > >
> > > > > > > > > > I have a query on parallel reading via S3.
Will you be
> > > > supporting
> > > > > > > > > >
> > > > > > > > > >    1. Reading one file in parallel ( say
4 block readers
> > > > reading
> > > > > > the
> > > > > > > > same
> > > > > > > > > >    file
> > > > > > > > > >
> > > > > > > > >         Ans: I think this is similar to feature
(3).
> > > > > > > > >
> > > > > > > > > >    2. Reading multiple files in parallel
but a file is
> > always
> > > > > read
> > > > > > > > > >    serially. So different block reader instances
read
> > > different
> > > > > > files
> > > > > > > > > >
> > > > > > > > >         Ans: Yes. By configuring "sequentialFileRead"
> > property
> > > to
> > > > > > true,
> > > > > > > > > this feature is enabled.
> > > > > > > > >
> > > > > > > > > >    3. Mix of 1 and 2. Multiple files are
read in
> parallel,
> > > and
> > > > > > every
> > > > > > > > file
> > > > > > > > > >    in itself is also read in parallel.
> > > > > > > > > >       Ans: Yes. By default, module enable
this feature.
> > > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > > There were issues while reading S3 files
in parallel for
> > > > earlier
> > > > > > > > versions
> > > > > > > > > > of Hadoop : 2.2.0 or so and a lot better
support in 2.7.
> > So,
> > > > will
> > > > > > > your
> > > > > > > > > > module work on all Hadoop versions post
2.2 or only 2.7?
> > > > > > > > > >
> > > > > > > > > >     Ans:  We would like to support parallel
read feature
> > for
> > > S3
> > > > > > with
> > > > > > > > > independent of Hadoop versions.
> > > > > > > > >
> > > > > > > > >     One way to support this feature is to copy
few S3
> related
> > > > files
> > > > > > > from
> > > > > > > > > Hadoop 2.7 version into the module and will use
this in
> > module.
> > > > > > > > >
> > > > > > > > >     With this approach, S3 Module supports parallel
read
> with
> > > > > > > independent
> > > > > > > > > of Hadoop version.
> > > > > > > > >
> > > > > > > > > @All:
> > > > > > > > >      Please share your thoughts on this approach.
> > > > > > > > >
> > > > > > > > > Regards,
> > > > > > > > > Chaitanya
> > > > > > > > >
> > > > > > > > > Regards,
> > > > > > > > > > Sandeep
> > > > > > > > > >
> > > > > > > > > > On Fri, Mar 18, 2016 at 10:49 AM, Pradeep
Dalvi <
> > > > > > > > > > pradeep.dalvi@datatorrent.com> wrote:
> > > > > > > > > >
> > > > > > > > > > > +1
> > > > > > > > > > >
> > > > > > > > > > > On Thu, Mar 17, 2016 at 10:56 PM, Amol
Kekre <
> > > > > > amol@datatorrent.com
> > > > > > > >
> > > > > > > > > > wrote:
> > > > > > > > > > >
> > > > > > > > > > > > +1. Very common use case. Nice
to have it.
> > > > > > > > > > > >
> > > > > > > > > > > > Thks
> > > > > > > > > > > > Amol
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > On Thu, Mar 17, 2016 at 1:49 AM,
Sandeep Deshmukh <
> > > > > > > > > > > sandeep@datatorrent.com
> > > > > > > > > > > > >
> > > > > > > > > > > > wrote:
> > > > > > > > > > > >
> > > > > > > > > > > > > +1
> > > > > > > > > > > > >
> > > > > > > > > > > > > Many people face issues while
copy data from S3 at
> > > large
> > > > > > scale.
> > > > > > > > > This
> > > > > > > > > > > > module
> > > > > > > > > > > > > is a great contribution that
can be readily used
> with
> > > > > simple
> > > > > > > > > > > > configuration.
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > Regards,
> > > > > > > > > > > > > Sandeep
> > > > > > > > > > > > >
> > > > > > > > > > > > > On Thu, Mar 17, 2016 at 2:04
PM, Priyanka Gugale <
> > > > > > > > > > > > priyanka@datatorrent.com
> > > > > > > > > > > > > >
> > > > > > > > > > > > > wrote:
> > > > > > > > > > > > >
> > > > > > > > > > > > > > It's a good idea to
extract out common code in
> > parent
> > > > > > class.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > +1 for this feature.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > -Priyanka
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > On Thu, Mar 17, 2016
at 1:57 PM, Chaitanya
> Chebolu
> > <
> > > > > > > > > > > > > > chaitanya@datatorrent.com>
wrote:
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Dear Community,
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >   I am proposing
S3 Input Module. Primary
> > > > functionality
> > > > > > of
> > > > > > > > this
> > > > > > > > > > > > module
> > > > > > > > > > > > > is
> > > > > > > > > > > > > > > to parallel read
files from S3 bucket.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >   Below is the
JIRA created for this task:
> > > > > > > > > > > > > > >
> > > > https://issues.apache.org/jira/browse/APEXMALHAR-2019
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >   Design of this
module is similar to HDFS
> input
> > > > > module.
> > > > > > > So,
> > > > > > > > I
> > > > > > > > > > will
> > > > > > > > > > > > > > extend
> > > > > > > > > > > > > > > HDFS input module
for S3 module.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >   Instead of extending
HDFS input module, I
> will
> > > > create
> > > > > > > > common
> > > > > > > > > > > class
> > > > > > > > > > > > > for
> > > > > > > > > > > > > > > all such file system
modules. JIRA for creating
> > > > common
> > > > > > > class
> > > > > > > > is
> > > > > > > > > > > here:
> > > > > > > > > > > > > > >
> > > > https://issues.apache.org/jira/browse/APEXMALHAR-2018
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >  Please share your
thoughts on this.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Regards,
> > > > > > > > > > > > > > > Chaitanya
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > --
> > > > > > > > > > > Pradeep A. Dalvi
> > > > > > > > > > >
> > > > > > > > > > > Software Engineer
> > > > > > > > > > > DataTorrent (India)
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>



-- 

Regards,
Ashwin.

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message