Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id B354A200B8B for ; Tue, 4 Oct 2016 08:40:53 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id B1E2C160AC9; Tue, 4 Oct 2016 06:40:53 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 999BB160AC5 for ; Tue, 4 Oct 2016 08:40:52 +0200 (CEST) Received: (qmail 6370 invoked by uid 500); 4 Oct 2016 06:40:51 -0000 Mailing-List: contact dev-help@apex.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@apex.apache.org Delivered-To: mailing list dev@apex.apache.org Received: (qmail 6350 invoked by uid 99); 4 Oct 2016 06:40:51 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd1-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 04 Oct 2016 06:40:51 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd1-us-west.apache.org (ASF Mail Server at spamd1-us-west.apache.org) with ESMTP id 11758C1896 for ; Tue, 4 Oct 2016 06:40:51 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd1-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 1.779 X-Spam-Level: * X-Spam-Status: No, score=1.779 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, HTML_MESSAGE=2, RCVD_IN_DNSWL_LOW=-0.7, RCVD_IN_MSPIKE_H3=-0.01, RCVD_IN_MSPIKE_WL=-0.01, RCVD_IN_SORBS_SPAM=0.5, SPF_PASS=-0.001] autolearn=disabled Authentication-Results: spamd1-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=datatorrent-com.20150623.gappssmtp.com Received: from mx1-lw-us.apache.org ([10.40.0.8]) by localhost (spamd1-us-west.apache.org [10.40.0.7]) (amavisd-new, port 10024) with ESMTP id lCqNYS_gq37v for ; Tue, 4 Oct 2016 06:40:48 +0000 (UTC) Received: from mail-qt0-f180.google.com (mail-qt0-f180.google.com [209.85.216.180]) by mx1-lw-us.apache.org (ASF Mail Server at mx1-lw-us.apache.org) with ESMTPS id 64A6A5F30B for ; Tue, 4 Oct 2016 06:40:48 +0000 (UTC) Received: by mail-qt0-f180.google.com with SMTP id f6so34215908qtd.2 for ; Mon, 03 Oct 2016 23:40:48 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=datatorrent-com.20150623.gappssmtp.com; s=20150623; h=mime-version:in-reply-to:references:from:date:message-id:subject:to; bh=JGopHFIN9UXYPqknhsOo8qeNgAaL2ewzx7/uR6kNG38=; b=YxEU14N5vb+O8t0eGm7xtTzghyCazhSJ5Y4yR3K9nudh9mUIe3aJVq0MRQzTRP5XCq 3bUixeWue2WogieyD8TCxzeEDOeqAbV9QtMXpONODlaE0XC2NUrD2cvKnHSVyR9Olx6+ rdQdJipLnislRhDvEPe4ZRJiuS/4yiX+4OAgb5hYZrW1nWfPlVzwVpFSN2MrbysuZ4Q0 ULKEMjne23tSzc5M4mt8+lBnrjgIBYEYZpzY6fj4uYySaRc/mK+nGewZLLHBUCSdSR8F k3igZLzsnP7HdyLtfy/fTw0v2OShBA3TnN2FTQpyg62aTy+h73v07JMu7PbhlueAFz43 Acbw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:in-reply-to:references:from:date :message-id:subject:to; bh=JGopHFIN9UXYPqknhsOo8qeNgAaL2ewzx7/uR6kNG38=; b=W2upSftPVe2Q4uOBLyVnfQvyfqyrOQZ5JX+6vrgjcky7x5JVfQyQq4k6uYn4e1buzd cGmUzBM/Jwl+82LORLSHgOJaxgZFUdTa57FA1kjbC4Tug5YeBM2HPK7M7JGl9CegeOl1 h8EpTjtADe+5FS78jVkHMG5I5zk7yYzTB8ZI+eFa5UAq5iCigG0si/u1ZCHW0NPoKWpG H0jolt/zMuM7z49woYfKArXWeE+MEBtaWmlw7b81almDNw7CF4NPmCYmsv/K9cKhEtzM jZXOGhJSgAvQhhv2y0LlQIniXRnQCjQJ1iXpZn8D4y6UyC1XX4rucdblsr1wYiztOQJf v+/Q== X-Gm-Message-State: AA6/9Rl/WI+to6ssdSJ08Hhbe+glHlmLhRCWLsT9/NrEjmrufY+M2cctr/aC/mX9TzDXf26rXDr5GfInn1RI62Jn X-Received: by 10.200.38.107 with SMTP id v40mr2067124qtv.76.1475563242193; Mon, 03 Oct 2016 23:40:42 -0700 (PDT) MIME-Version: 1.0 Received: by 10.55.169.88 with HTTP; Mon, 3 Oct 2016 23:40:41 -0700 (PDT) In-Reply-To: References: From: Hitesh Kapoor Date: Tue, 4 Oct 2016 12:10:41 +0530 Message-ID: Subject: Re: Fixed Width Record Parser To: dev@apex.apache.org Content-Type: multipart/alternative; boundary=001a1140300ce53685053e0456c2 archived-at: Tue, 04 Oct 2016 06:40:53 -0000 --001a1140300ce53685053e0456c2 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable Hi All, Thank you for thee feedback. I will use univocity for *parsing (only) , *will do the type checking/validation manually. Input schema is similar to that of CSV so will have to create another base class Schema (having common elements of Delimited and Fixed width Schema)and Delimited and Fixed Width Schema will inherit from it. Will use POJOUtils for constructing POJO. Regards, Hitesh On Tue, Oct 4, 2016 at 11:59 AM, Shubham Pathak wrote: > Hi Hitesh, > > I agree with Chinmay. -1 for creating our own library. > > +1 for using Univocity. > For input schema, I suggest we use the same one as used by Delimited > Parser. We would need to add fields to accept padding character, > startingCharacterPosition and endingCharacterPosition. > > To construct the POJOs you may use PojoUtils > library/src/main/java/com/datatorrent/lib/util/PojoUtils.java> > > Thanks, > Shubham > > > > > > On Mon, Oct 3, 2016 at 4:09 AM, Chinmay Kolhatkar > > wrote: > > > Hi Hitesh, > > > > In general I'm not in favor of reinventing the wheels. Because, for one= , > it > > takes effort to maintain the library, secondly, self written library > might > > take longer time to mature and become stable for production use. > > > > Hence, -1 from me for creating own library for fixed length parsing. > > > > I saw the libraries that you proposed and want to add one more library = to > > the list - jFFP (http://jffp.sourceforge.net/). > > > > To me jFFP and univocity looks good options. I'm personally more inclin= ed > > towards univocity because it seems to be active in development (last > commit > > 4 days ago) and secondly this library has been used in Fixed Length Fil= e > > Loader for Enrichment. > > > > My overall vote is to use univocity as much as possible and if there is > any > > missing (& important to us) feature in univocity, that should be added > over > > top in our operator. > > > > Thanks, > > Chinmay. > > > > > > On Mon, Oct 3, 2016 at 2:12 PM, Hitesh Kapoor > > wrote: > > > > > Hi All, > > > > > > Thank you for your feedback. > > > So as per the votes/comments, I will not be going ahead with approach= 2 > > as > > > it is not clean. > > > > > > For approach 1, I have looked at the possibility to use existing > parsing > > > libraries like flatworm, flatpack, univocity, > > > following are the problems with using exisiting libraries: > > > 1) These libraries take input schema in a specific format and are > > > complicated to use. > > > For example the most famous library (as per stackoverflow) flatworm > will > > > involve giving the input schema in Xml format (refer > > > http://flatworm.sourceforge.net/) so we will loose our consistency > with > > > existing parsers like CsvParser, where we take i/p in JSON format. No= t > > only > > > the consistency it will be more difficult for the user to give input = in > > > flatworm specific XML. > > > If we decide to convert our JSON to Flatworm specific Xml, it will > > involve > > > lot more work then to write your own library. > > > 2) Does only limited type checking for example for a Date type if it > > > adheres to dd/mm/yyyy, a date may parse correctly for i/p 12/13/2000 > > (month > > > is beyond 12) . > > > 3) Difficult to handle Boolean and Date datatypes. > > > 4) Future scalability may take a hit. For example if we want to add > more > > > constraints to our parser like min value for an integer or a pattern > for > > a > > > string , it won't be possible to do it with existing libraries. > > > 5) To retrieve the values to create a POJO is not user (coder) > friendly. > > > > > > According to me we should write our own library to do the parsing and > > > validation as to use an existing library will involve more work. > > > The work involved in coding the library is easy and straightforward. > > > It will be easier for us to scale and also provide an easy life for t= he > > end > > > user to provide the input schema. > > > The reason we are not going ahead with approach 2 is that it is not > > clean, > > > the twisting and turning involved in using (forcefully using) existin= g > > > libraries appears more dirty to me. > > > > > > Regards, > > > Hitesh > > > > > > > > > > > > On Thu, Sep 8, 2016 at 1:37 PM, Yogi Devendra < > > > devendra.vyavahare@gmail.com> > > > wrote: > > > > > > > If we specify order of the fields and length for each field then > start, > > > end > > > > can be computed. > > > > Why do we need end user to specify start position for each field? > > > > > > > > ~ Yogi > > > > > > > > On 8 September 2016 at 12:48, Chinmay Kolhatkar < > > chinmay@datatorrent.com > > > > > > > > wrote: > > > > > > > > > Few points/questions: > > > > > 1. Agree with Yogi. Approach 2 does not look clean. > > > > > 2. Do we need "recordwidthlength"? > > > > > 3. "recordseperator" should be "\n" and not "/n". > > > > > 4. In general, providing schema as a JSON is tedious from user > > > > perspective. > > > > > I suggest we find a simpler format for specifying schema. For eg. > > > > > ,,, > > > > > 5. I suggest we provide basic parser first to malhar which does > only > > > > > parsing and type checking. Constraints, IMO are not part of parsi= ng > > > > module > > > > > OR if needed can be added as phase 2 improvisation of this parser= . > > > > > 6. I would suggest to use some existing library for parsing. Ther= e > is > > > no > > > > > point in re-inventing the wheels and trying to make something > robust > > > can > > > > be > > > > > time consuming. > > > > > > > > > > -Chinmay. > > > > > > > > > > > > > > > On Wed, Sep 7, 2016 at 4:33 PM, Yogi Devendra < > > > > > devendra.vyavahare@gmail.com> > > > > > wrote: > > > > > > > > > > > Approach 2 does not look like a clean solution. > > > > > > > > > > > > -1 for Approach 2. > > > > > > > > > > > > ~ Yogi > > > > > > > > > > > > On 7 September 2016 at 15:25, Hitesh Kapoor < > > hitesh@datatorrent.com> > > > > > > wrote: > > > > > > > > > > > > > Hi All, > > > > > > > > > > > > > > An operator for parsing fixed width records has to be > > implemented. > > > > > > > This operator shall be used to parse fixed width byte > > array/tuples > > > > > based > > > > > > on > > > > > > > a JSON Schema and emit the parsed bytearray on one port; > > converted > > > > POJO > > > > > > > object on another port and the failed bytearray/tuples on an > > error > > > > > port. > > > > > > > > > > > > > > > > > > > > > User will provide a JSON schema definition based on the schem= a > > > > > definition > > > > > > > as mentioned below. > > > > > > > > > > > > > > { > > > > > > > > > > > > > > =E2=80=9Crecordwidthlength=E2=80=9D: =E2=80=9CInteger=E2=80= =9D > > > > > > > > > > > > > > "recordseparator": "/n", // this would be blank if there is n= o > > > record > > > > > > > separator, default - a newline character > > > > > > > > > > > > > > "fields": [ > > > > > > > > > > > > > > { > > > > > > > > > > > > > > "name": "", > > > > > > > > > > > > > > "type": "", > > > > > > > > > > > > > > =E2=80=9CstartCharNum=E2=80=9D: =E2=80=9C=E2=80=9D, > > > > > > > > > > > > > > =E2=80=9CendCharNum=E2=80=9D: =E2=80=9C=E2=80=9D, > > > > > > > > > > > > > > "constraints": { > > > > > > > > > > > > > > } > > > > > > > > > > > > > > }, > > > > > > > > > > > > > > { > > > > > > > > > > > > > > "name": "adName", > > > > > > > > > > > > > > "type": "String", > > > > > > > > > > > > > > =E2=80=9CstartCharNum=E2=80=9D: =E2=80=9CInteger=E2=80=9D, > > > > > > > > > > > > > > =E2=80=9CendCharNum=E2=80=9D: =E2=80=9CInteger=E2=80=9D, > > > > > > > > > > > > > > "constraints": { > > > > > > > > > > > > > > "required": "true", > > > > > > > > > > > > > > "pattern": "[a=C2=ADz].*[a=C2=ADz]$", > > > > > > > > > > > > > > } > > > > > > > > > > > > > > } > > > > > > > ] > > > > > > > } > > > > > > > > > > > > > > > > > > > > > Below are the options to implement this operator. > > > > > > > > > > > > > > 1) Write a new custom library for parsing fixed width records > as > > > > > existing > > > > > > > libraries for the same(e.g. flatowrm jffp etc.) do not have > > > mechanism > > > > > for > > > > > > > constraint checking. > > > > > > > The challenges in this approach will be to write a robust > library > > > > from > > > > > > > scratch to handle all our requirements. > > > > > > > > > > > > > > 2) Extend our already written CsvParser to handle fixed width > > > record. > > > > > In > > > > > > > this approach in the incoming tuple we will have to add a > > delimiter > > > > > > > "character" after every field in the record. > > > > > > > The challenges in this approach would be to select a delimite= r > > > > > character > > > > > > > and then if the character appears in the stream we will have = to > > > > escape > > > > > > that > > > > > > > character. > > > > > > > This approach will increase the memory overhead (as extra > > > characters > > > > > are > > > > > > > inserted as delimiters) but will be comparatively more easy t= o > > > > maintain > > > > > > and > > > > > > > operate. > > > > > > > > > > > > > > Please let me know your thoughts and votes on above approache= s. > > > > > > > > > > > > > > Regards, > > > > > > > Hitesh > > > > > > > > > > > > > > > > > > > > > > > > > > > > --001a1140300ce53685053e0456c2--