Return-Path: X-Original-To: apmail-apex-dev-archive@minotaur.apache.org Delivered-To: apmail-apex-dev-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 596D018635 for ; Sun, 6 Mar 2016 16:54:02 +0000 (UTC) Received: (qmail 21914 invoked by uid 500); 6 Mar 2016 16:54:01 -0000 Delivered-To: apmail-apex-dev-archive@apex.apache.org Received: (qmail 21851 invoked by uid 500); 6 Mar 2016 16:54:01 -0000 Mailing-List: contact dev-help@apex.incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@apex.incubator.apache.org Delivered-To: mailing list dev@apex.incubator.apache.org Received: (qmail 21839 invoked by uid 99); 6 Mar 2016 16:54:01 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd2-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 06 Mar 2016 16:54:01 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd2-us-west.apache.org (ASF Mail Server at spamd2-us-west.apache.org) with ESMTP id 0C1B21A089E for ; Sun, 6 Mar 2016 16:54:01 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd2-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 1.279 X-Spam-Level: * X-Spam-Status: No, score=1.279 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, HTML_MESSAGE=2, RCVD_IN_DNSWL_LOW=-0.7, RCVD_IN_MSPIKE_H3=-0.01, RCVD_IN_MSPIKE_WL=-0.01, SPF_PASS=-0.001] autolearn=disabled Authentication-Results: spamd2-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=datatorrent-com.20150623.gappssmtp.com Received: from mx1-lw-eu.apache.org ([10.40.0.8]) by localhost (spamd2-us-west.apache.org [10.40.0.9]) (amavisd-new, port 10024) with ESMTP id qUSMJzmDLuel for ; Sun, 6 Mar 2016 16:53:57 +0000 (UTC) Received: from mail-pa0-f49.google.com (mail-pa0-f49.google.com [209.85.220.49]) by mx1-lw-eu.apache.org (ASF Mail Server at mx1-lw-eu.apache.org) with ESMTPS id 7F9E35F3F2 for ; Sun, 6 Mar 2016 16:53:56 +0000 (UTC) Received: by mail-pa0-f49.google.com with SMTP id tt10so2100699pab.3 for ; Sun, 06 Mar 2016 08:53:56 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=datatorrent-com.20150623.gappssmtp.com; s=20150623; h=mime-version:in-reply-to:references:date:message-id:subject:from:to; bh=dlqIsHrz7fj/chbp0lKL5CeEmecE3f1B4OqM19oXD00=; b=Rj3wCz95ZNKsEJOxRgHUJSrR0EdzcsCDDLqVMbiANW2q920M6i4hYS+1trtVHqGq8Y I5ZLdoMQa3NLrluzZNSGFTwXL6SShMtoVNOFm9xTlGpB+Qbi++/22waTbOI/wPyN8Ezr JQSpGXHJdjPVRVbiONxnOw83ugdUtfN5BZSQ8xELUk+yDOKg+Z0bTS0YMpFefqrjFSXI bjwQVFzeA01UiJJcWEOK8Rq1ynUSGIh6dkzYYa3lDAoqbSHrEm7iGJptYXJhHIpZ1Mbn p0hyBGpds7YSBpfoN2D3odbTN5Y/dnUAwkRgLRVAPPIcLW8XjkyeDYFzLZUMRtNNkrgK kcrg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:in-reply-to:references:date :message-id:subject:from:to; bh=dlqIsHrz7fj/chbp0lKL5CeEmecE3f1B4OqM19oXD00=; b=XFI1TJp0SdxgQwYDoD03A/q4dZnBa8Vi5B+8ru1rK4Dr8dC9CIQWcvOL9Pu32rjkNe eDm08WjkoAZ5luVx/qkMBIJPNdeoMBnn+yM/nYBw9y/Hj9WgV3QFo8h6JzbkKSAwKnmR +VQgGYX48A49bFHkr/cCs9k4N3MLPhj2Z2INvNxUqhwjfBUhzYHZu155H1AGN8xUB4GQ aR5VhKSDNVR9gZmNnJs6cUiwU++s10JHYBLw8tEXu+7V+oqrLBIDgAn9iGzCLogOx/TK 5IPyp7dO9Yc2TGY1TGDk0tCk1WFF5m5h8XXsjalz+zGQlQewt+ssu9mrPNkNZRxGiamX OUWA== X-Gm-Message-State: AD7BkJJWpN8FR4Jsq2dkotL7sFVCzCC0ca8FCib/WWSVYZocQd1eA0bZRnie0eeNwUOiF0Pe3lpovnPzgFOXIKNV MIME-Version: 1.0 X-Received: by 10.66.129.130 with SMTP id nw2mr27580751pab.80.1457283235094; Sun, 06 Mar 2016 08:53:55 -0800 (PST) Received: by 10.66.250.130 with HTTP; Sun, 6 Mar 2016 08:53:54 -0800 (PST) In-Reply-To: References: Date: Sun, 6 Mar 2016 08:53:54 -0800 Message-ID: Subject: Re: Proposal for concrete operator for writing to HDFS file From: Munagala Ramanath To: dev@apex.incubator.apache.org Content-Type: multipart/alternative; boundary=001a1136475e90bc19052d643186 --001a1136475e90bc19052d643186 Content-Type: text/plain; charset=UTF-8 Yogi, I think I understand the intent. However, in: "Main use-case being : data is read from some source, processed tuple-by-tuple by some operators and then given to this proposed concrete operator for writing to HDFS." Does "from some source" specifically exclude files ? If so, we should explicitly state this. In my view, we should make the operator as flexible as reasonably possible without limiting it to particular "use cases". Consider the expected typical scenario, an upstream operator X sends tuples to this proposed operator Y. 1. How does Y know what the file name is, given a tuple (i.e. implementation of *getFileName()*) ? 2. How does Y know when to call *requestFinalize()* for a file (multiple files could be in progress) ? 3. Is it partitionable ? The base class is not for some reason though the file input operator is. 4. The directory where files are written is a fixed property in the base class annotated with *@NotNull*; what if this path is not known upfront but is dynamically constructed on a per-file basis. How does X send this info to Y ? When looking at files, the simplest example a user will think of is file copy, so I think we should make that work, and work well. To do that, the file input operator may also need to be carefully examined and changed suitably if necessary. I guess addressing it in a module is certainly an option but having file input and output operators with elaborate features, class hierarchies, and tutorials but where the simplest possible use case is not easy is doing users a disservice. Ram On Sun, Mar 6, 2016 at 12:29 AM, Yogi Devendra wrote: > Ram, > > Aim of this concrete operator is write incoming tuples to HDFS files. > > Main use-case being : data is read from some source, processed > tuple-by-tuple by some operators and then given to this proposed concrete > operator for writing to HDFS. > > As you pointed out, file operation is another common use-case; but we can > work out separate mechanism which handles the complexities explained in > your post. > Priyanka has already posted about proposal for HDFS input module having > FileSplitter + BlockReader operator. > I will post another proposal for HDFS file copy module which would > seamlessly integrate with HDFS input module to solve file copy use-case. > > Question: > Is it acceptable if we have concrete operator (current proposal) for > tuple-by-tuple writing and have separate module to take care of file copy > use-cases? > > ~ Yogi > > On 6 March 2016 at 09:45, Munagala Ramanath wrote: > > > Since the AbstractFileInputOperator provides a concrete implementation > > (FileLineInputOperator in the same file) > > it seems reasonable to have one for the output operator as well. > > > > Another basic and reasonable requirement is that it should be possible to > > connect the input and output operators > > without any further fussing and get a robust and high performance > > application for copying files from source to > > destination. There are a number of issues that crop up in doing this > > though: The input operator can read and > > dispatch tuples from multiple files in the same window; how does it tell > > the output operator where the file > > boundaries are ? Special control tuples sent inline are one possibility; > > control tuples sent via a separate port > > are another. Tagging each tuple with the file name is a third. Each has > > additional aspects to consider > > such as impact on performance, time skew between multiple input ports, > etc. > > > > Ram > > > > On Thu, Mar 3, 2016 at 5:51 PM, Yogi Devendra > > wrote: > > > > > Any suggestions/ comments on this? > > > > > > ~ Yogi > > > > > > On 3 March 2016 at 17:44, Yogi Devendra > wrote: > > > > > > > Hi, > > > > > > > > Currently, for writing to HDFS file we have > AbstractFileOutputOperator > > in > > > > the malhar library. > > > > > > > > It has following abstract methods : > > > > 1. protected abstract String getFileName(INPUT tuple) > > > > 2. protected abstract byte[] getBytesForTuple(INPUT tuple) > > > > > > > > These methods are kept generic to give flexibility to the app > > developers. > > > > But, someone who is new to apex; would look for ready-made > > implementation > > > > instead of extending Abstract implementation. > > > > > > > > Thus, I am proposing to add concrete operator HDFSOutputOperator to > > > > malhar. Aim of this operator would be to serve the purpose of ready > to > > > use > > > > operator for most frequent use-cases. > > > > > > > > Here are my key observations on most frequent use-cases: > > > > > > > > > > > > > > ------------------------------------------------------------------------------ > > > > > > > > 1. Writing tuples of type byte[] or String. > > > > 2. All tuples on a particular stream land up in the same output file. > > > > 3. App developer may want to add some custom tuple separator (e.g. > > > newline > > > > character) between tuples. > > > > > > > > Please mention your comments regarding : > > > > -------------------------------------------------------- > > > > > > > > 1. Will it be useful to have such concrete operator? > > > > > > > > 2. Do you think of any other datatype other than byte[], String that > > > > should be supported out of the box by this concrete operator? > > > > Currently, I am planning to include byte[], String, any other type > > having > > > > valid toString() as input tuples. > > > > > > > > 3. Do you think tuple separator should be configurable? > > > > > > > > 4. Any other feedback? > > > > > > > > > > > > Proposed design: > > > > ---------------------- > > > > > > > > 1. This concrete implementation will be extending > > > > AbstractFileOutputOperator with default implementation for abstract > > > methods > > > > mentioned above. > > > > > > > > 2. Filename , Tuple separator will be exposed as a operator property. > > > > > > > > 3. All incoming tuples will be written to same file mentioned in the > > > > property. > > > > > > > > 4. This operator will be added to malhar library under package > > > > com.datatorrent.lib.io.fs where AbstractFileOutputOperator resides. > > > > > > > > ~ Yogi > > > > > > > > > > --001a1136475e90bc19052d643186--