Mailing-List: contact dev-help@apex.apache.org; run by ezmlm
Precedence: bulk
Reply-To: dev@apex.apache.org
MIME-Version: 1.0
In-Reply-To: <CANMVRkkja2O6Kt3MMSL8KoB8h=AVDKjOJste0der3RQWqifjdg@mail.gmail.com>
References: <CA+LJDcUUoZ6wBsO=anWxdfD-bORug2VV_260eFVN6ssQf+qqEQ@mail.gmail.com>
 <CAHekGF8prFmxCap5HkFXqx_AziLVE4TObgmL0TMceVA5UkjJXQ@mail.gmail.com>
 <CABAipVZD61NKeGG3JZ3kXVKXbetoRunq5vixmKD8U=P1pHZo4A@mail.gmail.com>
 <CAHekGF9+50i-gH5jdTfvWJ2fsmwFsdUxA16q-NkKkUG=F80R7w@mail.gmail.com>
 <CA+LJDcXJaXuLTwJQtZGH8FaAqw3bxV42+g7G9+_E5r2HasLsuw@mail.gmail.com>
 <CABAipVZtFkF=5xw-T-w_keY5Sw9jOdevVKOVUyp5ofJO8vzV3Q@mail.gmail.com> <CANMVRkkja2O6Kt3MMSL8KoB8h=AVDKjOJste0der3RQWqifjdg@mail.gmail.com>
From: Hitesh Kapoor <hitesh@datatorrent.com>
Date: Tue, 4 Oct 2016 12:10:41 +0530
Message-ID: <CA+LJDcUrARtsBC6tqZKTxgptmoUJ+-Xfp3SceBvLEoVhcWCQJw@mail.gmail.com>
Subject: Re: Fixed Width Record Parser
To: dev@apex.apache.org
Content-Type: multipart/alternative; boundary=001a1140300ce53685053e0456c2
archived-at: Tue, 04 Oct 2016 06:40:53 -0000

--001a1140300ce53685053e0456c2
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

Hi All,

Thank you for thee feedback.
I will use univocity for *parsing (only) , *will do the type
checking/validation manually.
Input schema is similar to that of CSV so will have to create another base
class Schema (having common elements of Delimited and Fixed width
Schema)and Delimited and Fixed Width Schema will inherit from it.
Will use POJOUtils for constructing POJO.

Regards,
Hitesh


On Tue, Oct 4, 2016 at 11:59 AM, Shubham Pathak <shubham@datatorrent.com>
wrote:

> Hi Hitesh,
>
> I agree with Chinmay. -1 for creating our own library.
>
> +1 for using Univocity.
> For input schema, I suggest we use the same one as used by Delimited
> Parser. We would need to add fields to accept padding character,
> startingCharacterPosition and endingCharacterPosition.
>
> To construct the POJOs you may use PojoUtils
> <https://github.com/apache/apex-malhar/blob/master/
> library/src/main/java/com/datatorrent/lib/util/PojoUtils.java>
>
> Thanks,
> Shubham
>
>
>
>
>
> On Mon, Oct 3, 2016 at 4:09 AM, Chinmay Kolhatkar <chinmay@datatorrent.co=
m
> >
> wrote:
>
> > Hi Hitesh,
> >
> > In general I'm not in favor of reinventing the wheels. Because, for one=
,
> it
> > takes effort to maintain the library, secondly, self written library
> might
> > take longer time to mature and become stable for production use.
> >
> > Hence, -1 from me for creating own library for fixed length parsing.
> >
> > I saw the libraries that you proposed and want to add one more library =
to
> > the list - jFFP (http://jffp.sourceforge.net/).
> >
> > To me jFFP and univocity looks good options. I'm personally more inclin=
ed
> > towards univocity because it seems to be active in development (last
> commit
> > 4 days ago) and secondly this library has been used in Fixed Length Fil=
e
> > Loader for Enrichment.
> >
> > My overall vote is to use univocity as much as possible and if there is
> any
> > missing (& important to us) feature in univocity, that should be added
> over
> > top in our operator.
> >
> > Thanks,
> > Chinmay.
> >
> >
> > On Mon, Oct 3, 2016 at 2:12 PM, Hitesh Kapoor <hitesh@datatorrent.com>
> > wrote:
> >
> > > Hi All,
> > >
> > > Thank you for your feedback.
> > > So as per the votes/comments, I will not be going ahead with approach=
 2
> > as
> > > it is not clean.
> > >
> > > For approach 1, I have looked at the possibility to use existing
> parsing
> > > libraries like flatworm, flatpack, univocity,
> > > following are the problems with using exisiting libraries:
> > > 1) These libraries take input schema in a specific format and are
> > > complicated to use.
> > > For example the most famous library (as per stackoverflow) flatworm
> will
> > > involve giving the input schema in Xml format (refer
> > > http://flatworm.sourceforge.net/) so we will loose our consistency
> with
> > > existing parsers like CsvParser, where we take i/p in JSON format. No=
t
> > only
> > > the consistency it will be more difficult for the user to give input =
in
> > > flatworm specific XML.
> > > If we decide to convert our JSON to Flatworm specific Xml, it will
> > involve
> > > lot more work then to write your own library.
> > > 2)  Does only limited type checking for example for a Date type if it
> > > adheres to dd/mm/yyyy, a date may parse correctly for i/p 12/13/2000
> > (month
> > > is beyond 12) .
> > > 3) Difficult to handle Boolean and Date datatypes.
> > > 4) Future scalability may take a hit. For example if we want to add
> more
> > > constraints to our parser like min value for an integer or a pattern
> for
> > a
> > > string , it won't be possible to do it with existing libraries.
> > > 5) To retrieve the values to create a POJO is not user (coder)
> friendly.
> > >
> > > According to me we should write our own library to do the parsing and
> > > validation  as to use an existing library will involve more work.
> > > The work involved in coding the library is easy and straightforward.
> > > It will be easier for us to scale and also provide an easy life for t=
he
> > end
> > > user to provide the input schema.
> > > The reason we are not going ahead with approach 2 is that it is not
> > clean,
> > > the twisting and turning involved in using (forcefully using) existin=
g
> > > libraries appears more dirty to me.
> > >
> > > Regards,
> > > Hitesh
> > >
> > >
> > >
> > > On Thu, Sep 8, 2016 at 1:37 PM, Yogi Devendra <
> > > devendra.vyavahare@gmail.com>
> > > wrote:
> > >
> > > > If we specify order of the fields and length for each field then
> start,
> > > end
> > > > can be computed.
> > > > Why do we need end user to specify start position for each field?
> > > >
> > > > ~ Yogi
> > > >
> > > > On 8 September 2016 at 12:48, Chinmay Kolhatkar <
> > chinmay@datatorrent.com
> > > >
> > > > wrote:
> > > >
> > > > > Few points/questions:
> > > > > 1. Agree with Yogi. Approach 2 does not look clean.
> > > > > 2. Do we need "recordwidthlength"?
> > > > > 3. "recordseperator" should be "\n" and not "/n".
> > > > > 4. In general, providing schema as a JSON is tedious from user
> > > > perspective.
> > > > > I suggest we find a simpler format for specifying schema. For eg.
> > > > > <name>,<type>,<startPointer>,<fieldLength>
> > > > > 5. I suggest we provide basic parser first to malhar which does
> only
> > > > > parsing and type checking. Constraints, IMO are not part of parsi=
ng
> > > > module
> > > > > OR if needed can be added as phase 2 improvisation of this parser=
.
> > > > > 6. I would suggest to use some existing library for parsing. Ther=
e
> is
> > > no
> > > > > point in re-inventing the wheels and trying to make something
> robust
> > > can
> > > > be
> > > > > time consuming.
> > > > >
> > > > > -Chinmay.
> > > > >
> > > > >
> > > > > On Wed, Sep 7, 2016 at 4:33 PM, Yogi Devendra <
> > > > > devendra.vyavahare@gmail.com>
> > > > > wrote:
> > > > >
> > > > > > Approach 2 does not look like a clean solution.
> > > > > >
> > > > > > -1 for Approach 2.
> > > > > >
> > > > > > ~ Yogi
> > > > > >
> > > > > > On 7 September 2016 at 15:25, Hitesh Kapoor <
> > hitesh@datatorrent.com>
> > > > > > wrote:
> > > > > >
> > > > > > > Hi All,
> > > > > > >
> > > > > > > An operator for parsing fixed width records has to be
> > implemented.
> > > > > > > This operator shall be used to parse fixed width byte
> > array/tuples
> > > > > based
> > > > > > on
> > > > > > > a JSON Schema and emit the parsed bytearray on one port;
> > converted
> > > > POJO
> > > > > > > object on another port and the failed bytearray/tuples on an
> > error
> > > > > port.
> > > > > > >
> > > > > > >
> > > > > > > User will provide a JSON schema definition based on the schem=
a
> > > > > definition
> > > > > > > as mentioned below.
> > > > > > >
> > > > > > > {
> > > > > > >
> > > > > > > =E2=80=9Crecordwidthlength=E2=80=9D: =E2=80=9CInteger=E2=80=
=9D
> > > > > > >
> > > > > > > "recordseparator": "/n", // this would be blank if there is n=
o
> > > record
> > > > > > > separator, default - a newline character
> > > > > > >
> > > > > > > "fields": [
> > > > > > >
> > > > > > > {
> > > > > > >
> > > > > > > "name": "<Name of the Field>",
> > > > > > >
> > > > > > > "type": "<Data Type of Field>",
> > > > > > >
> > > > > > > =E2=80=9CstartCharNum=E2=80=9D: =E2=80=9C<Integer - Starting =
Character Position>=E2=80=9D,
> > > > > > >
> > > > > > > =E2=80=9CendCharNum=E2=80=9D: =E2=80=9C<Integer - End Charact=
er Position>=E2=80=9D,
> > > > > > >
> > > > > > > "constraints": {
> > > > > > >
> > > > > > > }
> > > > > > >
> > > > > > > },
> > > > > > >
> > > > > > > {
> > > > > > >
> > > > > > > "name": "adName",
> > > > > > >
> > > > > > > "type": "String",
> > > > > > >
> > > > > > > =E2=80=9CstartCharNum=E2=80=9D: =E2=80=9CInteger=E2=80=9D,
> > > > > > >
> > > > > > > =E2=80=9CendCharNum=E2=80=9D: =E2=80=9CInteger=E2=80=9D,
> > > > > > >
> > > > > > > "constraints": {
> > > > > > >
> > > > > > > "required": "true",
> > > > > > >
> > > > > > > "pattern": "[a=C2=ADz].*[a=C2=ADz]$",
> > > > > > >
> > > > > > > }
> > > > > > >
> > > > > > > }
> > > > > > > ]
> > > > > > > }
> > > > > > >
> > > > > > >
> > > > > > > Below are the options to implement this operator.
> > > > > > >
> > > > > > > 1) Write a new custom library for parsing fixed width records
> as
> > > > > existing
> > > > > > > libraries for the same(e.g. flatowrm jffp etc.) do not have
> > > mechanism
> > > > > for
> > > > > > > constraint checking.
> > > > > > > The challenges in this approach will be to write a robust
> library
> > > > from
> > > > > > > scratch to handle all our requirements.
> > > > > > >
> > > > > > > 2) Extend our already written CsvParser to handle fixed width
> > > record.
> > > > > In
> > > > > > > this approach in the incoming tuple we will have to add a
> > delimiter
> > > > > > > "character" after every field in the record.
> > > > > > > The challenges in this approach would be to select a delimite=
r
> > > > > character
> > > > > > > and then if the character appears in the stream we will have =
to
> > > > escape
> > > > > > that
> > > > > > > character.
> > > > > > > This approach will increase the memory overhead (as extra
> > > characters
> > > > > are
> > > > > > > inserted as delimiters) but will be comparatively more easy t=
o
> > > > maintain
> > > > > > and
> > > > > > > operate.
> > > > > > >
> > > > > > > Please let me know your thoughts and votes on above approache=
s.
> > > > > > >
> > > > > > > Regards,
> > > > > > > Hitesh
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

--001a1140300ce53685053e0456c2--