Return-Path: X-Original-To: apmail-apex-dev-archive@minotaur.apache.org Delivered-To: apmail-apex-dev-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 0AC471928B for ; Mon, 7 Mar 2016 12:10:44 +0000 (UTC) Received: (qmail 72906 invoked by uid 500); 7 Mar 2016 12:10:44 -0000 Delivered-To: apmail-apex-dev-archive@apex.apache.org Received: (qmail 72845 invoked by uid 500); 7 Mar 2016 12:10:43 -0000 Mailing-List: contact dev-help@apex.incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@apex.incubator.apache.org Delivered-To: mailing list dev@apex.incubator.apache.org Received: (qmail 72834 invoked by uid 99); 7 Mar 2016 12:10:43 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd1-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 07 Mar 2016 12:10:43 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd1-us-west.apache.org (ASF Mail Server at spamd1-us-west.apache.org) with ESMTP id 78434C0135 for ; Mon, 7 Mar 2016 12:10:43 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd1-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: -3.221 X-Spam-Level: X-Spam-Status: No, score=-3.221 tagged_above=-999 required=6.31 tests=[KAM_ASCII_DIVIDERS=0.8, KAM_LAZY_DOMAIN_SECURITY=1, RCVD_IN_DNSWL_HI=-5, RCVD_IN_MSPIKE_H3=-0.01, RCVD_IN_MSPIKE_WL=-0.01, RP_MATCHES_RCVD=-0.001] autolearn=disabled Received: from mx1-lw-us.apache.org ([10.40.0.8]) by localhost (spamd1-us-west.apache.org [10.40.0.7]) (amavisd-new, port 10024) with ESMTP id VrbI4OBr0uhh for ; Mon, 7 Mar 2016 12:10:42 +0000 (UTC) Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by mx1-lw-us.apache.org (ASF Mail Server at mx1-lw-us.apache.org) with SMTP id 362985FB00 for ; Mon, 7 Mar 2016 12:10:42 +0000 (UTC) Received: (qmail 72517 invoked by uid 99); 7 Mar 2016 12:10:41 -0000 Received: from arcas.apache.org (HELO arcas) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 07 Mar 2016 12:10:41 +0000 Received: from arcas.apache.org (localhost [127.0.0.1]) by arcas (Postfix) with ESMTP id AD3E52C1F54 for ; Mon, 7 Mar 2016 12:10:40 +0000 (UTC) Date: Mon, 7 Mar 2016 12:10:40 +0000 (UTC) From: "Yogi Devendra (JIRA)" To: dev@apex.incubator.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (APEXMALHAR-2009) concrete operator for writing to HDFS file MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/APEXMALHAR-2009?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15182939#comment-15182939 ] Yogi Devendra commented on APEXMALHAR-2009: ------------------------------------------- [Ram] Yogi, I think I understand the intent. However, in: "Main use-case being : data is read from some source, processed tuple-by-tuple by some operators and then given to this proposed concrete operator for writing to HDFS." Does "from some source" specifically exclude files ? If so, we should explicitly state this. In my view, we should make the operator as flexible as reasonably possible without limiting it to particular "use cases". Consider the expected typical scenario, an upstream operator X sends tuples to this proposed operator Y. 1. How does Y know what the file name is, given a tuple (i.e. implementation of *getFileName()*) ? 2. How does Y know when to call *requestFinalize()* for a file (multiple files could be in progress) ? 3. Is it partitionable ? The base class is not for some reason though the file input operator is. 4. The directory where files are written is a fixed property in the base class annotated with *@NotNull*; what if this path is not known upfront but is dynamically constructed on a per-file basis. How does X send this info to Y ? When looking at files, the simplest example a user will think of is file copy, so I think we should make that work, and work well. To do that, the file input operator may also need to be carefully examined and changed suitably if necessary. I guess addressing it in a module is certainly an option but having file input and output operators with elaborate features, class hierarchies, and tutorials but where the simplest possible use case is not easy is doing users a disservice. Ram > concrete operator for writing to HDFS file > ------------------------------------------ > > Key: APEXMALHAR-2009 > URL: https://issues.apache.org/jira/browse/APEXMALHAR-2009 > Project: Apache Apex Malhar > Issue Type: Task > Reporter: Yogi Devendra > Assignee: Yogi Devendra > > Currently, for writing to HDFS file we have AbstractFileOutputOperator in the malhar library. > It has following abstract methods : > 1. protected abstract String getFileName(INPUT tuple) > 2. protected abstract byte[] getBytesForTuple(INPUT tuple) > These methods are kept generic to give flexibility to the app developers. But, someone who is new to apex; would look for ready-made implementation instead of extending Abstract implementation. > Thus, I am proposing to add concrete operator HDFSOutputOperator to malhar. Aim of this operator would be to serve the purpose of ready to use operator for most frequent use-cases. > Here are my key observations on most frequent use-cases: > ------------------------------------------------------------------------------ > 1. Writing tuples of type byte[] or String. > 2. All tuples on a particular stream land up in the same output file. > 3. App developer may want to add some custom tuple separator (e.g. newline character) between tuples. > Discussion thread on mailing list here: > http://mail-archives.apache.org/mod_mbox/apex-dev/201603.mbox/%3CCAHekGF_6KovS4cjYXzCLdU9En0iPaKO%2BBv%3DEJXbrCuhe9%2BtdrA%40mail.gmail.com%3E -- This message was sent by Atlassian JIRA (v6.3.4#6332)