apex-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Yogi Devendra (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (APEXMALHAR-2009) concrete operator for writing to HDFS file
Date Mon, 07 Mar 2016 12:10:40 GMT

    [ https://issues.apache.org/jira/browse/APEXMALHAR-2009?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15182939#comment-15182939
] 

Yogi Devendra commented on APEXMALHAR-2009:
-------------------------------------------

[Ram]

Yogi, I think I understand the intent. However, in:

"Main use-case being : data is read from some source, processed
tuple-by-tuple by some operators and then given to this proposed concrete
operator for writing to HDFS."

Does "from some source" specifically exclude files ? If so, we should
explicitly state this.
In my view, we should make the operator as flexible as reasonably possible
without limiting
it to particular "use cases".

Consider the expected typical scenario, an upstream operator X sends tuples
to this proposed operator Y.
1. How does Y know what the file name is, given a tuple (i.e.
implementation of *getFileName()*) ?
2. How does Y know when to call *requestFinalize()* for a file (multiple
files could be in progress) ?
3. Is it partitionable ? The base class is not for some reason though the
file input operator is.
4. The directory where files are written is a fixed property in the base
class annotated with *@NotNull*; what
    if this path is not known upfront but is dynamically constructed on a
per-file basis.
    How does X send this info to Y ?

When looking at files, the simplest example a user will think of is file
copy, so I think we should make
that work, and work well. To do that, the file input operator may also need
to be carefully examined
and changed suitably if necessary.

I guess addressing it in a module is certainly an option but having file
input and output operators
with elaborate features, class hierarchies, and tutorials but where the
simplest possible use case
is not easy is doing users a disservice.

Ram

> concrete operator for writing to HDFS file
> ------------------------------------------
>
>                 Key: APEXMALHAR-2009
>                 URL: https://issues.apache.org/jira/browse/APEXMALHAR-2009
>             Project: Apache Apex Malhar
>          Issue Type: Task
>            Reporter: Yogi Devendra
>            Assignee: Yogi Devendra
>
> Currently, for writing to HDFS file we have AbstractFileOutputOperator in the malhar
library.
> It has following abstract methods :
> 1. protected abstract String getFileName(INPUT tuple)
> 2. protected abstract byte[] getBytesForTuple(INPUT tuple)
> These methods are kept generic to give flexibility to the app developers. But, someone
who is new to apex; would look for ready-made implementation instead of extending Abstract
implementation.
> Thus, I am proposing to add concrete operator HDFSOutputOperator to malhar. Aim of this
operator would be to serve the purpose of ready to use operator for most frequent use-cases.
> Here are my key observations on most frequent use-cases:
> ------------------------------------------------------------------------------
> 1. Writing tuples of type byte[] or String. 
> 2. All tuples on a particular stream land up in the same output file.
> 3. App developer may want to add some custom tuple separator (e.g. newline character)
between tuples.
> Discussion thread on mailing list here:
> http://mail-archives.apache.org/mod_mbox/apex-dev/201603.mbox/%3CCAHekGF_6KovS4cjYXzCLdU9En0iPaKO%2BBv%3DEJXbrCuhe9%2BtdrA%40mail.gmail.com%3E



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message