apex-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Thomas Weise <tho...@datatorrent.com>
Subject Re: S3 Input Module
Date Fri, 18 Mar 2016 16:27:24 GMT
Does the parallel read functionality automatically become available when
using Hadoop 2.7 or later or do you have to write code against different
API?

Would prefer to not copy things from Hadoop in that case as most users are
or will be soon on that version.


On Fri, Mar 18, 2016 at 6:07 AM, Chaitanya Chebolu <
chaitanya@datatorrent.com> wrote:

> Yes Yogi. For Parallel read, we cannot support Hadoop versions below 2.7
> without copying.
> I suggested this approach.
>
>
> On Fri, Mar 18, 2016 at 6:26 PM, Yogi Devendra <yogidevendra@apache.org>
> wrote:
>
> > Chaitanya,
> >
> > Do you mean to say that we cannot support hadoop versions below 2.7
> without
> > copying few files from Hadoop 2.7 implementation?
> >
> > CMIIW (Correct Me If I'm Wrong).
> >
> >
> > ~ Yogi
> >
> > On 18 March 2016 at 18:05, Chaitanya Chebolu <chaitanya@datatorrent.com>
> > wrote:
> >
> > > Hi Sandeep,
> > >
> > >   I am extending from HDFS input module. I am supporting the same
> > features
> > > for S3, which are supported by HDFS input module.
> > >
> > >   Please find my comments in-line.
> > >
> > > On Fri, Mar 18, 2016 at 12:07 PM, Sandeep Deshmukh <
> > > sandeep@datatorrent.com>
> > > wrote:
> > >
> > > > Hi Chaitanya,
> > > >
> > > > I have a query on parallel reading via S3. Will you be supporting
> > > >
> > > >    1. Reading one file in parallel ( say 4 block readers reading the
> > same
> > > >    file
> > > >
> > >         Ans: I think this is similar to feature (3).
> > >
> > > >    2. Reading multiple files in parallel but a file is always read
> > > >    serially. So different block reader instances read different files
> > > >
> > >         Ans: Yes. By configuring "sequentialFileRead" property to true,
> > > this feature is enabled.
> > >
> > > >    3. Mix of 1 and 2. Multiple files are read in parallel, and every
> > file
> > > >    in itself is also read in parallel.
> > > >       Ans: Yes. By default, module enable this feature.
> > > >
> > >
> > >
> > > > There were issues while reading S3 files in parallel for earlier
> > versions
> > > > of Hadoop : 2.2.0 or so and a lot better support in 2.7. So, will
> your
> > > > module work on all Hadoop versions post 2.2 or only 2.7?
> > > >
> > > >     Ans:  We would like to support parallel read feature for S3 with
> > > independent of Hadoop versions.
> > >
> > >     One way to support this feature is to copy few S3 related files
> from
> > > Hadoop 2.7 version into the module and will use this in module.
> > >
> > >     With this approach, S3 Module supports parallel read with
> independent
> > > of Hadoop version.
> > >
> > > @All:
> > >      Please share your thoughts on this approach.
> > >
> > > Regards,
> > > Chaitanya
> > >
> > > Regards,
> > > > Sandeep
> > > >
> > > > On Fri, Mar 18, 2016 at 10:49 AM, Pradeep Dalvi <
> > > > pradeep.dalvi@datatorrent.com> wrote:
> > > >
> > > > > +1
> > > > >
> > > > > On Thu, Mar 17, 2016 at 10:56 PM, Amol Kekre <amol@datatorrent.com
> >
> > > > wrote:
> > > > >
> > > > > > +1. Very common use case. Nice to have it.
> > > > > >
> > > > > > Thks
> > > > > > Amol
> > > > > >
> > > > > >
> > > > > > On Thu, Mar 17, 2016 at 1:49 AM, Sandeep Deshmukh <
> > > > > sandeep@datatorrent.com
> > > > > > >
> > > > > > wrote:
> > > > > >
> > > > > > > +1
> > > > > > >
> > > > > > > Many people face issues while copy data from S3 at large
scale.
> > > This
> > > > > > module
> > > > > > > is a great contribution that can be readily used with simple
> > > > > > configuration.
> > > > > > >
> > > > > > >
> > > > > > > Regards,
> > > > > > > Sandeep
> > > > > > >
> > > > > > > On Thu, Mar 17, 2016 at 2:04 PM, Priyanka Gugale <
> > > > > > priyanka@datatorrent.com
> > > > > > > >
> > > > > > > wrote:
> > > > > > >
> > > > > > > > It's a good idea to extract out common code in parent
class.
> > > > > > > >
> > > > > > > > +1 for this feature.
> > > > > > > >
> > > > > > > > -Priyanka
> > > > > > > >
> > > > > > > > On Thu, Mar 17, 2016 at 1:57 PM, Chaitanya Chebolu
<
> > > > > > > > chaitanya@datatorrent.com> wrote:
> > > > > > > >
> > > > > > > > > Dear Community,
> > > > > > > > >
> > > > > > > > >   I am proposing S3 Input Module. Primary functionality
of
> > this
> > > > > > module
> > > > > > > is
> > > > > > > > > to parallel read files from S3 bucket.
> > > > > > > > >
> > > > > > > > >   Below is the JIRA created for this task:
> > > > > > > > > https://issues.apache.org/jira/browse/APEXMALHAR-2019
> > > > > > > > >
> > > > > > > > >   Design of this module is similar to HDFS input
module.
> So,
> > I
> > > > will
> > > > > > > > extend
> > > > > > > > > HDFS input module for S3 module.
> > > > > > > > >
> > > > > > > > >   Instead of extending HDFS input module, I will
create
> > common
> > > > > class
> > > > > > > for
> > > > > > > > > all such file system modules. JIRA for creating
common
> class
> > is
> > > > > here:
> > > > > > > > > https://issues.apache.org/jira/browse/APEXMALHAR-2018
> > > > > > > > >
> > > > > > > > >  Please share your thoughts on this.
> > > > > > > > >
> > > > > > > > > Regards,
> > > > > > > > > Chaitanya
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > > >
> > > > >
> > > > > --
> > > > > Pradeep A. Dalvi
> > > > >
> > > > > Software Engineer
> > > > > DataTorrent (India)
> > > > >
> > > >
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message