apex-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chaitanya Chebolu <chaita...@datatorrent.com>
Subject Re: S3 Input Module
Date Fri, 18 Mar 2016 17:40:29 GMT
Hi Thomas,

When using Hadoop 2.7+ version, parallel read functionality automatically
available. For this feature, no need to write any additional code.

Regards,
Chaitanya

On Fri, Mar 18, 2016 at 9:57 PM, Thomas Weise <thomas@datatorrent.com>
wrote:

> Does the parallel read functionality automatically become available when
> using Hadoop 2.7 or later or do you have to write code against different
> API?
>
> Would prefer to not copy things from Hadoop in that case as most users are
> or will be soon on that version.
>
>
> On Fri, Mar 18, 2016 at 6:07 AM, Chaitanya Chebolu <
> chaitanya@datatorrent.com> wrote:
>
> > Yes Yogi. For Parallel read, we cannot support Hadoop versions below 2.7
> > without copying.
> > I suggested this approach.
> >
> >
> > On Fri, Mar 18, 2016 at 6:26 PM, Yogi Devendra <yogidevendra@apache.org>
> > wrote:
> >
> > > Chaitanya,
> > >
> > > Do you mean to say that we cannot support hadoop versions below 2.7
> > without
> > > copying few files from Hadoop 2.7 implementation?
> > >
> > > CMIIW (Correct Me If I'm Wrong).
> > >
> > >
> > > ~ Yogi
> > >
> > > On 18 March 2016 at 18:05, Chaitanya Chebolu <
> chaitanya@datatorrent.com>
> > > wrote:
> > >
> > > > Hi Sandeep,
> > > >
> > > >   I am extending from HDFS input module. I am supporting the same
> > > features
> > > > for S3, which are supported by HDFS input module.
> > > >
> > > >   Please find my comments in-line.
> > > >
> > > > On Fri, Mar 18, 2016 at 12:07 PM, Sandeep Deshmukh <
> > > > sandeep@datatorrent.com>
> > > > wrote:
> > > >
> > > > > Hi Chaitanya,
> > > > >
> > > > > I have a query on parallel reading via S3. Will you be supporting
> > > > >
> > > > >    1. Reading one file in parallel ( say 4 block readers reading
> the
> > > same
> > > > >    file
> > > > >
> > > >         Ans: I think this is similar to feature (3).
> > > >
> > > > >    2. Reading multiple files in parallel but a file is always read
> > > > >    serially. So different block reader instances read different
> files
> > > > >
> > > >         Ans: Yes. By configuring "sequentialFileRead" property to
> true,
> > > > this feature is enabled.
> > > >
> > > > >    3. Mix of 1 and 2. Multiple files are read in parallel, and
> every
> > > file
> > > > >    in itself is also read in parallel.
> > > > >       Ans: Yes. By default, module enable this feature.
> > > > >
> > > >
> > > >
> > > > > There were issues while reading S3 files in parallel for earlier
> > > versions
> > > > > of Hadoop : 2.2.0 or so and a lot better support in 2.7. So, will
> > your
> > > > > module work on all Hadoop versions post 2.2 or only 2.7?
> > > > >
> > > > >     Ans:  We would like to support parallel read feature for S3
> with
> > > > independent of Hadoop versions.
> > > >
> > > >     One way to support this feature is to copy few S3 related files
> > from
> > > > Hadoop 2.7 version into the module and will use this in module.
> > > >
> > > >     With this approach, S3 Module supports parallel read with
> > independent
> > > > of Hadoop version.
> > > >
> > > > @All:
> > > >      Please share your thoughts on this approach.
> > > >
> > > > Regards,
> > > > Chaitanya
> > > >
> > > > Regards,
> > > > > Sandeep
> > > > >
> > > > > On Fri, Mar 18, 2016 at 10:49 AM, Pradeep Dalvi <
> > > > > pradeep.dalvi@datatorrent.com> wrote:
> > > > >
> > > > > > +1
> > > > > >
> > > > > > On Thu, Mar 17, 2016 at 10:56 PM, Amol Kekre <
> amol@datatorrent.com
> > >
> > > > > wrote:
> > > > > >
> > > > > > > +1. Very common use case. Nice to have it.
> > > > > > >
> > > > > > > Thks
> > > > > > > Amol
> > > > > > >
> > > > > > >
> > > > > > > On Thu, Mar 17, 2016 at 1:49 AM, Sandeep Deshmukh <
> > > > > > sandeep@datatorrent.com
> > > > > > > >
> > > > > > > wrote:
> > > > > > >
> > > > > > > > +1
> > > > > > > >
> > > > > > > > Many people face issues while copy data from S3 at
large
> scale.
> > > > This
> > > > > > > module
> > > > > > > > is a great contribution that can be readily used with
simple
> > > > > > > configuration.
> > > > > > > >
> > > > > > > >
> > > > > > > > Regards,
> > > > > > > > Sandeep
> > > > > > > >
> > > > > > > > On Thu, Mar 17, 2016 at 2:04 PM, Priyanka Gugale <
> > > > > > > priyanka@datatorrent.com
> > > > > > > > >
> > > > > > > > wrote:
> > > > > > > >
> > > > > > > > > It's a good idea to extract out common code in
parent
> class.
> > > > > > > > >
> > > > > > > > > +1 for this feature.
> > > > > > > > >
> > > > > > > > > -Priyanka
> > > > > > > > >
> > > > > > > > > On Thu, Mar 17, 2016 at 1:57 PM, Chaitanya Chebolu
<
> > > > > > > > > chaitanya@datatorrent.com> wrote:
> > > > > > > > >
> > > > > > > > > > Dear Community,
> > > > > > > > > >
> > > > > > > > > >   I am proposing S3 Input Module. Primary
functionality
> of
> > > this
> > > > > > > module
> > > > > > > > is
> > > > > > > > > > to parallel read files from S3 bucket.
> > > > > > > > > >
> > > > > > > > > >   Below is the JIRA created for this task:
> > > > > > > > > > https://issues.apache.org/jira/browse/APEXMALHAR-2019
> > > > > > > > > >
> > > > > > > > > >   Design of this module is similar to HDFS
input module.
> > So,
> > > I
> > > > > will
> > > > > > > > > extend
> > > > > > > > > > HDFS input module for S3 module.
> > > > > > > > > >
> > > > > > > > > >   Instead of extending HDFS input module,
I will create
> > > common
> > > > > > class
> > > > > > > > for
> > > > > > > > > > all such file system modules. JIRA for creating
common
> > class
> > > is
> > > > > > here:
> > > > > > > > > > https://issues.apache.org/jira/browse/APEXMALHAR-2018
> > > > > > > > > >
> > > > > > > > > >  Please share your thoughts on this.
> > > > > > > > > >
> > > > > > > > > > Regards,
> > > > > > > > > > Chaitanya
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > --
> > > > > > Pradeep A. Dalvi
> > > > > >
> > > > > > Software Engineer
> > > > > > DataTorrent (India)
> > > > > >
> > > > >
> > > >
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message