apex-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Hitesh Kapoor <hit...@datatorrent.com>
Subject Fixed Width Record Parser
Date Wed, 07 Sep 2016 09:55:46 GMT
Hi All,

An operator for parsing fixed width records has to be implemented.
This operator shall be used to parse fixed width byte array/tuples based on
a JSON Schema and emit the parsed bytearray on one port; converted POJO
object on another port and the failed bytearray/tuples on an error port.


User will provide a JSON schema definition based on the schema definition
as mentioned below.

{

“recordwidthlength”: “Integer”

"recordseparator": "/n", // this would be blank if there is no record
separator, default - a newline character

"fields": [

{

"name": "<Name of the Field>",

"type": "<Data Type of Field>",

“startCharNum”: “<Integer - Starting Character Position>”,

“endCharNum”: “<Integer - End Character Position>”,

"constraints": {

}

},

{

"name": "adName",

"type": "String",

“startCharNum”: “Integer”,

“endCharNum”: “Integer”,

"constraints": {

"required": "true",

"pattern": "[a­z].*[a­z]$",

}

}
]
}


Below are the options to implement this operator.

1) Write a new custom library for parsing fixed width records as existing
libraries for the same(e.g. flatowrm jffp etc.) do not have mechanism for
constraint checking.
The challenges in this approach will be to write a robust library from
scratch to handle all our requirements.

2) Extend our already written CsvParser to handle fixed width record. In
this approach in the incoming tuple we will have to add a delimiter
"character" after every field in the record.
The challenges in this approach would be to select a delimiter character
and then if the character appears in the stream we will have to escape that
character.
This approach will increase the memory overhead (as extra characters are
inserted as delimiters) but will be comparatively more easy to maintain and
operate.

Please let me know your thoughts and votes on above approaches.

Regards,
Hitesh

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message