Mailing-List: contact dev-help@apex.incubator.apache.org; run by ezmlm
Precedence: bulk
Reply-To: dev@apex.incubator.apache.org
MIME-Version: 1.0
In-Reply-To: 
 <CAHekGF-BnByKgsE+5KM5FL+g9U5G-2-+BQ5kPq7FweT8v38o4A@mail.gmail.com>
References: 
 <CAHekGF_6KovS4cjYXzCLdU9En0iPaKO+Bv=EJXbrCuhe9+tdrA@mail.gmail.com>
	<CAHekGF8xmOZjNtqCYxwiuHL-oQ74unTd9hFnBhWxErfaEB=t5A@mail.gmail.com>
	<CAJ484Bhtc5XLF9wZOeV78Sw03UT7UPUzNqbnJGc5qq3aheSxTA@mail.gmail.com>
	<CAHekGF-BnByKgsE+5KM5FL+g9U5G-2-+BQ5kPq7FweT8v38o4A@mail.gmail.com>
Date: Sun, 6 Mar 2016 08:53:54 -0800
Message-ID: 
 <CAJ484BjimX=KD3T5zVGgBrx1PNCmo5Uig0SxqVOKOUzaLEw07g@mail.gmail.com>
Subject: Re: Proposal for concrete operator for writing to HDFS file
From: Munagala Ramanath <ram@datatorrent.com>
To: dev@apex.incubator.apache.org
Content-Type: multipart/alternative; boundary=001a1136475e90bc19052d643186

--001a1136475e90bc19052d643186
Content-Type: text/plain; charset=UTF-8

Yogi, I think I understand the intent. However, in:

"Main use-case being : data is read from some source, processed
tuple-by-tuple by some operators and then given to this proposed concrete
operator for writing to HDFS."

Does "from some source" specifically exclude files ? If so, we should
explicitly state this.
In my view, we should make the operator as flexible as reasonably possible
without limiting
it to particular "use cases".

Consider the expected typical scenario, an upstream operator X sends tuples
to this proposed operator Y.
1. How does Y know what the file name is, given a tuple (i.e.
implementation of *getFileName()*) ?
2. How does Y know when to call *requestFinalize()* for a file (multiple
files could be in progress) ?
3. Is it partitionable ? The base class is not for some reason though the
file input operator is.
4. The directory where files are written is a fixed property in the base
class annotated with *@NotNull*; what
    if this path is not known upfront but is dynamically constructed on a
per-file basis.
    How does X send this info to Y ?

When looking at files, the simplest example a user will think of is file
copy, so I think we should make
that work, and work well. To do that, the file input operator may also need
to be carefully examined
and changed suitably if necessary.

I guess addressing it in a module is certainly an option but having file
input and output operators
with elaborate features, class hierarchies, and tutorials but where the
simplest possible use case
is not easy is doing users a disservice.

Ram


On Sun, Mar 6, 2016 at 12:29 AM, Yogi Devendra <devendra.vyavahare@gmail.com
> wrote:

> Ram,
>
> Aim of this concrete operator is write incoming tuples to HDFS files.
>
> Main use-case being : data is read from some source, processed
> tuple-by-tuple by some operators and then given to this proposed concrete
> operator for writing to HDFS.
>
> As you pointed out, file operation is another common use-case; but we can
> work out separate mechanism which handles the complexities explained in
> your post.
> Priyanka has already posted about proposal for HDFS input module having
> FileSplitter + BlockReader operator.
> I will post another proposal for HDFS file copy module which would
> seamlessly integrate with HDFS input module to solve file copy use-case.
>
> Question:
> Is it acceptable if we have concrete operator (current proposal) for
> tuple-by-tuple writing and have separate module to take care of file copy
> use-cases?
>
> ~ Yogi
>
> On 6 March 2016 at 09:45, Munagala Ramanath <ram@datatorrent.com> wrote:
>
> > Since the AbstractFileInputOperator provides a concrete implementation
> > (FileLineInputOperator in the same file)
> > it seems reasonable to have one for the output operator as well.
> >
> > Another basic and reasonable requirement is that it should be possible to
> > connect the input and output operators
> > without any further fussing and get a robust and high performance
> > application for copying files from source to
> > destination. There are a number of issues that crop up in doing this
> > though: The input operator can read and
> > dispatch tuples from multiple files in the same window; how does it tell
> > the output operator where the file
> > boundaries are ? Special control tuples sent inline are one possibility;
> > control tuples sent via a separate port
> > are another. Tagging each tuple with the file name is a third. Each has
> > additional aspects to consider
> > such as impact on performance, time skew between multiple input ports,
> etc.
> >
> > Ram
> >
> > On Thu, Mar 3, 2016 at 5:51 PM, Yogi Devendra <yogidevendra@apache.org>
> > wrote:
> >
> > > Any suggestions/ comments on this?
> > >
> > > ~ Yogi
> > >
> > > On 3 March 2016 at 17:44, Yogi Devendra <yogidevendra@apache.org>
> wrote:
> > >
> > > > Hi,
> > > >
> > > > Currently, for writing to HDFS file we have
> AbstractFileOutputOperator
> > in
> > > > the malhar library.
> > > >
> > > > It has following abstract methods :
> > > > 1. protected abstract String getFileName(INPUT tuple)
> > > > 2. protected abstract byte[] getBytesForTuple(INPUT tuple)
> > > >
> > > > These methods are kept generic to give flexibility to the app
> > developers.
> > > > But, someone who is new to apex; would look for ready-made
> > implementation
> > > > instead of extending Abstract implementation.
> > > >
> > > > Thus, I am proposing to add concrete operator HDFSOutputOperator to
> > > > malhar. Aim of this operator would be to serve the purpose of ready
> to
> > > use
> > > > operator for most frequent use-cases.
> > > >
> > > > Here are my key observations on most frequent use-cases:
> > > >
> > > >
> > >
> >
> ------------------------------------------------------------------------------
> > > >
> > > > 1. Writing tuples of type byte[] or String.
> > > > 2. All tuples on a particular stream land up in the same output file.
> > > > 3. App developer may want to add some custom tuple separator (e.g.
> > > newline
> > > > character) between tuples.
> > > >
> > > > Please mention your comments regarding :
> > > > --------------------------------------------------------
> > > >
> > > > 1. Will it be useful to have such concrete operator?
> > > >
> > > > 2. Do you think of any other datatype other than byte[], String that
> > > > should be supported out of the box by this concrete operator?
> > > > Currently, I am planning to include byte[], String, any other type
> > having
> > > > valid toString() as input tuples.
> > > >
> > > > 3. Do you think tuple separator should be configurable?
> > > >
> > > > 4. Any other feedback?
> > > >
> > > >
> > > > Proposed design:
> > > > ----------------------
> > > >
> > > > 1. This concrete implementation will be extending
> > > > AbstractFileOutputOperator with default implementation for abstract
> > > methods
> > > > mentioned above.
> > > >
> > > > 2. Filename , Tuple separator will be exposed as a operator property.
> > > >
> > > > 3. All incoming tuples will be written to same file mentioned in the
> > > > property.
> > > >
> > > > 4. This operator will be added to malhar library under package
> > > > com.datatorrent.lib.io.fs where AbstractFileOutputOperator resides.
> > > >
> > > > ~ Yogi
> > > >
> > >
> >
>

--001a1136475e90bc19052d643186--