apex-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Yogi Devendra <devendra.vyavah...@gmail.com>
Subject Re: Proposal for concrete operator for writing to HDFS file
Date Sun, 06 Mar 2016 08:29:11 GMT
Ram,

Aim of this concrete operator is write incoming tuples to HDFS files.

Main use-case being : data is read from some source, processed
tuple-by-tuple by some operators and then given to this proposed concrete
operator for writing to HDFS.

As you pointed out, file operation is another common use-case; but we can
work out separate mechanism which handles the complexities explained in
your post.
Priyanka has already posted about proposal for HDFS input module having
FileSplitter + BlockReader operator.
I will post another proposal for HDFS file copy module which would
seamlessly integrate with HDFS input module to solve file copy use-case.

Question:
Is it acceptable if we have concrete operator (current proposal) for
tuple-by-tuple writing and have separate module to take care of file copy
use-cases?

~ Yogi

On 6 March 2016 at 09:45, Munagala Ramanath <ram@datatorrent.com> wrote:

> Since the AbstractFileInputOperator provides a concrete implementation
> (FileLineInputOperator in the same file)
> it seems reasonable to have one for the output operator as well.
>
> Another basic and reasonable requirement is that it should be possible to
> connect the input and output operators
> without any further fussing and get a robust and high performance
> application for copying files from source to
> destination. There are a number of issues that crop up in doing this
> though: The input operator can read and
> dispatch tuples from multiple files in the same window; how does it tell
> the output operator where the file
> boundaries are ? Special control tuples sent inline are one possibility;
> control tuples sent via a separate port
> are another. Tagging each tuple with the file name is a third. Each has
> additional aspects to consider
> such as impact on performance, time skew between multiple input ports, etc.
>
> Ram
>
> On Thu, Mar 3, 2016 at 5:51 PM, Yogi Devendra <yogidevendra@apache.org>
> wrote:
>
> > Any suggestions/ comments on this?
> >
> > ~ Yogi
> >
> > On 3 March 2016 at 17:44, Yogi Devendra <yogidevendra@apache.org> wrote:
> >
> > > Hi,
> > >
> > > Currently, for writing to HDFS file we have AbstractFileOutputOperator
> in
> > > the malhar library.
> > >
> > > It has following abstract methods :
> > > 1. protected abstract String getFileName(INPUT tuple)
> > > 2. protected abstract byte[] getBytesForTuple(INPUT tuple)
> > >
> > > These methods are kept generic to give flexibility to the app
> developers.
> > > But, someone who is new to apex; would look for ready-made
> implementation
> > > instead of extending Abstract implementation.
> > >
> > > Thus, I am proposing to add concrete operator HDFSOutputOperator to
> > > malhar. Aim of this operator would be to serve the purpose of ready to
> > use
> > > operator for most frequent use-cases.
> > >
> > > Here are my key observations on most frequent use-cases:
> > >
> > >
> >
> ------------------------------------------------------------------------------
> > >
> > > 1. Writing tuples of type byte[] or String.
> > > 2. All tuples on a particular stream land up in the same output file.
> > > 3. App developer may want to add some custom tuple separator (e.g.
> > newline
> > > character) between tuples.
> > >
> > > Please mention your comments regarding :
> > > --------------------------------------------------------
> > >
> > > 1. Will it be useful to have such concrete operator?
> > >
> > > 2. Do you think of any other datatype other than byte[], String that
> > > should be supported out of the box by this concrete operator?
> > > Currently, I am planning to include byte[], String, any other type
> having
> > > valid toString() as input tuples.
> > >
> > > 3. Do you think tuple separator should be configurable?
> > >
> > > 4. Any other feedback?
> > >
> > >
> > > Proposed design:
> > > ----------------------
> > >
> > > 1. This concrete implementation will be extending
> > > AbstractFileOutputOperator with default implementation for abstract
> > methods
> > > mentioned above.
> > >
> > > 2. Filename , Tuple separator will be exposed as a operator property.
> > >
> > > 3. All incoming tuples will be written to same file mentioned in the
> > > property.
> > >
> > > 4. This operator will be added to malhar library under package
> > > com.datatorrent.lib.io.fs where AbstractFileOutputOperator resides.
> > >
> > > ~ Yogi
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message