apex-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Shubham Pathak <shub...@datatorrent.com>
Subject Re: Fixed Width Record Parser
Date Tue, 04 Oct 2016 06:29:57 GMT
Hi Hitesh,

I agree with Chinmay. -1 for creating our own library.

+1 for using Univocity.
For input schema, I suggest we use the same one as used by Delimited
Parser. We would need to add fields to accept padding character,
startingCharacterPosition and endingCharacterPosition.

To construct the POJOs you may use PojoUtils
<https://github.com/apache/apex-malhar/blob/master/library/src/main/java/com/datatorrent/lib/util/PojoUtils.java>

Thanks,
Shubham





On Mon, Oct 3, 2016 at 4:09 AM, Chinmay Kolhatkar <chinmay@datatorrent.com>
wrote:

> Hi Hitesh,
>
> In general I'm not in favor of reinventing the wheels. Because, for one, it
> takes effort to maintain the library, secondly, self written library might
> take longer time to mature and become stable for production use.
>
> Hence, -1 from me for creating own library for fixed length parsing.
>
> I saw the libraries that you proposed and want to add one more library to
> the list - jFFP (http://jffp.sourceforge.net/).
>
> To me jFFP and univocity looks good options. I'm personally more inclined
> towards univocity because it seems to be active in development (last commit
> 4 days ago) and secondly this library has been used in Fixed Length File
> Loader for Enrichment.
>
> My overall vote is to use univocity as much as possible and if there is any
> missing (& important to us) feature in univocity, that should be added over
> top in our operator.
>
> Thanks,
> Chinmay.
>
>
> On Mon, Oct 3, 2016 at 2:12 PM, Hitesh Kapoor <hitesh@datatorrent.com>
> wrote:
>
> > Hi All,
> >
> > Thank you for your feedback.
> > So as per the votes/comments, I will not be going ahead with approach 2
> as
> > it is not clean.
> >
> > For approach 1, I have looked at the possibility to use existing parsing
> > libraries like flatworm, flatpack, univocity,
> > following are the problems with using exisiting libraries:
> > 1) These libraries take input schema in a specific format and are
> > complicated to use.
> > For example the most famous library (as per stackoverflow) flatworm will
> > involve giving the input schema in Xml format (refer
> > http://flatworm.sourceforge.net/) so we will loose our consistency with
> > existing parsers like CsvParser, where we take i/p in JSON format. Not
> only
> > the consistency it will be more difficult for the user to give input in
> > flatworm specific XML.
> > If we decide to convert our JSON to Flatworm specific Xml, it will
> involve
> > lot more work then to write your own library.
> > 2)  Does only limited type checking for example for a Date type if it
> > adheres to dd/mm/yyyy, a date may parse correctly for i/p 12/13/2000
> (month
> > is beyond 12) .
> > 3) Difficult to handle Boolean and Date datatypes.
> > 4) Future scalability may take a hit. For example if we want to add more
> > constraints to our parser like min value for an integer or a pattern for
> a
> > string , it won't be possible to do it with existing libraries.
> > 5) To retrieve the values to create a POJO is not user (coder) friendly.
> >
> > According to me we should write our own library to do the parsing and
> > validation  as to use an existing library will involve more work.
> > The work involved in coding the library is easy and straightforward.
> > It will be easier for us to scale and also provide an easy life for the
> end
> > user to provide the input schema.
> > The reason we are not going ahead with approach 2 is that it is not
> clean,
> > the twisting and turning involved in using (forcefully using) existing
> > libraries appears more dirty to me.
> >
> > Regards,
> > Hitesh
> >
> >
> >
> > On Thu, Sep 8, 2016 at 1:37 PM, Yogi Devendra <
> > devendra.vyavahare@gmail.com>
> > wrote:
> >
> > > If we specify order of the fields and length for each field then start,
> > end
> > > can be computed.
> > > Why do we need end user to specify start position for each field?
> > >
> > > ~ Yogi
> > >
> > > On 8 September 2016 at 12:48, Chinmay Kolhatkar <
> chinmay@datatorrent.com
> > >
> > > wrote:
> > >
> > > > Few points/questions:
> > > > 1. Agree with Yogi. Approach 2 does not look clean.
> > > > 2. Do we need "recordwidthlength"?
> > > > 3. "recordseperator" should be "\n" and not "/n".
> > > > 4. In general, providing schema as a JSON is tedious from user
> > > perspective.
> > > > I suggest we find a simpler format for specifying schema. For eg.
> > > > <name>,<type>,<startPointer>,<fieldLength>
> > > > 5. I suggest we provide basic parser first to malhar which does only
> > > > parsing and type checking. Constraints, IMO are not part of parsing
> > > module
> > > > OR if needed can be added as phase 2 improvisation of this parser.
> > > > 6. I would suggest to use some existing library for parsing. There is
> > no
> > > > point in re-inventing the wheels and trying to make something robust
> > can
> > > be
> > > > time consuming.
> > > >
> > > > -Chinmay.
> > > >
> > > >
> > > > On Wed, Sep 7, 2016 at 4:33 PM, Yogi Devendra <
> > > > devendra.vyavahare@gmail.com>
> > > > wrote:
> > > >
> > > > > Approach 2 does not look like a clean solution.
> > > > >
> > > > > -1 for Approach 2.
> > > > >
> > > > > ~ Yogi
> > > > >
> > > > > On 7 September 2016 at 15:25, Hitesh Kapoor <
> hitesh@datatorrent.com>
> > > > > wrote:
> > > > >
> > > > > > Hi All,
> > > > > >
> > > > > > An operator for parsing fixed width records has to be
> implemented.
> > > > > > This operator shall be used to parse fixed width byte
> array/tuples
> > > > based
> > > > > on
> > > > > > a JSON Schema and emit the parsed bytearray on one port;
> converted
> > > POJO
> > > > > > object on another port and the failed bytearray/tuples on an
> error
> > > > port.
> > > > > >
> > > > > >
> > > > > > User will provide a JSON schema definition based on the schema
> > > > definition
> > > > > > as mentioned below.
> > > > > >
> > > > > > {
> > > > > >
> > > > > > “recordwidthlength”: “Integer”
> > > > > >
> > > > > > "recordseparator": "/n", // this would be blank if there is
no
> > record
> > > > > > separator, default - a newline character
> > > > > >
> > > > > > "fields": [
> > > > > >
> > > > > > {
> > > > > >
> > > > > > "name": "<Name of the Field>",
> > > > > >
> > > > > > "type": "<Data Type of Field>",
> > > > > >
> > > > > > “startCharNum”: “<Integer - Starting Character Position>”,
> > > > > >
> > > > > > “endCharNum”: “<Integer - End Character Position>”,
> > > > > >
> > > > > > "constraints": {
> > > > > >
> > > > > > }
> > > > > >
> > > > > > },
> > > > > >
> > > > > > {
> > > > > >
> > > > > > "name": "adName",
> > > > > >
> > > > > > "type": "String",
> > > > > >
> > > > > > “startCharNum”: “Integer”,
> > > > > >
> > > > > > “endCharNum”: “Integer”,
> > > > > >
> > > > > > "constraints": {
> > > > > >
> > > > > > "required": "true",
> > > > > >
> > > > > > "pattern": "[a­z].*[a­z]$",
> > > > > >
> > > > > > }
> > > > > >
> > > > > > }
> > > > > > ]
> > > > > > }
> > > > > >
> > > > > >
> > > > > > Below are the options to implement this operator.
> > > > > >
> > > > > > 1) Write a new custom library for parsing fixed width records
as
> > > > existing
> > > > > > libraries for the same(e.g. flatowrm jffp etc.) do not have
> > mechanism
> > > > for
> > > > > > constraint checking.
> > > > > > The challenges in this approach will be to write a robust library
> > > from
> > > > > > scratch to handle all our requirements.
> > > > > >
> > > > > > 2) Extend our already written CsvParser to handle fixed width
> > record.
> > > > In
> > > > > > this approach in the incoming tuple we will have to add a
> delimiter
> > > > > > "character" after every field in the record.
> > > > > > The challenges in this approach would be to select a delimiter
> > > > character
> > > > > > and then if the character appears in the stream we will have
to
> > > escape
> > > > > that
> > > > > > character.
> > > > > > This approach will increase the memory overhead (as extra
> > characters
> > > > are
> > > > > > inserted as delimiters) but will be comparatively more easy
to
> > > maintain
> > > > > and
> > > > > > operate.
> > > > > >
> > > > > > Please let me know your thoughts and votes on above approaches.
> > > > > >
> > > > > > Regards,
> > > > > > Hitesh
> > > > > >
> > > > >
> > > >
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message