apex-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Munagala Ramanath <...@datatorrent.com>
Subject Re: Proposal for concrete operator for writing to HDFS file
Date Sun, 06 Mar 2016 04:15:02 GMT
Since the AbstractFileInputOperator provides a concrete implementation
(FileLineInputOperator in the same file)
it seems reasonable to have one for the output operator as well.

Another basic and reasonable requirement is that it should be possible to
connect the input and output operators
without any further fussing and get a robust and high performance
application for copying files from source to
destination. There are a number of issues that crop up in doing this
though: The input operator can read and
dispatch tuples from multiple files in the same window; how does it tell
the output operator where the file
boundaries are ? Special control tuples sent inline are one possibility;
control tuples sent via a separate port
are another. Tagging each tuple with the file name is a third. Each has
additional aspects to consider
such as impact on performance, time skew between multiple input ports, etc.

Ram

On Thu, Mar 3, 2016 at 5:51 PM, Yogi Devendra <yogidevendra@apache.org>
wrote:

> Any suggestions/ comments on this?
>
> ~ Yogi
>
> On 3 March 2016 at 17:44, Yogi Devendra <yogidevendra@apache.org> wrote:
>
> > Hi,
> >
> > Currently, for writing to HDFS file we have AbstractFileOutputOperator in
> > the malhar library.
> >
> > It has following abstract methods :
> > 1. protected abstract String getFileName(INPUT tuple)
> > 2. protected abstract byte[] getBytesForTuple(INPUT tuple)
> >
> > These methods are kept generic to give flexibility to the app developers.
> > But, someone who is new to apex; would look for ready-made implementation
> > instead of extending Abstract implementation.
> >
> > Thus, I am proposing to add concrete operator HDFSOutputOperator to
> > malhar. Aim of this operator would be to serve the purpose of ready to
> use
> > operator for most frequent use-cases.
> >
> > Here are my key observations on most frequent use-cases:
> >
> >
> ------------------------------------------------------------------------------
> >
> > 1. Writing tuples of type byte[] or String.
> > 2. All tuples on a particular stream land up in the same output file.
> > 3. App developer may want to add some custom tuple separator (e.g.
> newline
> > character) between tuples.
> >
> > Please mention your comments regarding :
> > --------------------------------------------------------
> >
> > 1. Will it be useful to have such concrete operator?
> >
> > 2. Do you think of any other datatype other than byte[], String that
> > should be supported out of the box by this concrete operator?
> > Currently, I am planning to include byte[], String, any other type having
> > valid toString() as input tuples.
> >
> > 3. Do you think tuple separator should be configurable?
> >
> > 4. Any other feedback?
> >
> >
> > Proposed design:
> > ----------------------
> >
> > 1. This concrete implementation will be extending
> > AbstractFileOutputOperator with default implementation for abstract
> methods
> > mentioned above.
> >
> > 2. Filename , Tuple separator will be exposed as a operator property.
> >
> > 3. All incoming tuples will be written to same file mentioned in the
> > property.
> >
> > 4. This operator will be added to malhar library under package
> > com.datatorrent.lib.io.fs where AbstractFileOutputOperator resides.
> >
> > ~ Yogi
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message