beam-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Etienne Chauchot (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (BEAM-2802) TextIO should allow specifying a custom delimiter
Date Tue, 29 Aug 2017 08:37:00 GMT

    [ https://issues.apache.org/jira/browse/BEAM-2802?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16144944#comment-16144944
] 

Etienne Chauchot commented on BEAM-2802:
----------------------------------------

I agree the {{\002\n}} example seems odd. But IMHO we should allow the user to add whatever
he wants as a delimiter, we should not limit him as long as this liberty has a limited performance
/ maintenance cost. 

For the text file definition: I think that as long as a file does not contain binary and is
not in a common format like XML or JSON, it can be considered as a pure text file no matter
if it can be easily read by a human or if it has new line delimiters.

IMHO, I also think that files with custom delimiters are quite a common user problem (we have
several client use cases and the other related PR tends to also prove its need). IMHO, I think
it merits being included in the Beam SDK.

I would be reluctant to chip a specific TextIO to clients for maintenance reasons.

I could create another IO but I find it's a pity to duplicate all the structure of an IO whereas
the change is 15 lines of code in TextIO#findSeparatorBounds(). Maybe I can submit the PR
that updates the TextIO and we can discuss it towards maintenance, performance and code location
aspects in the PR.


> TextIO should allow specifying a custom delimiter
> -------------------------------------------------
>
>                 Key: BEAM-2802
>                 URL: https://issues.apache.org/jira/browse/BEAM-2802
>             Project: Beam
>          Issue Type: New Feature
>          Components: sdk-java-extensions
>            Reporter: Etienne Chauchot
>            Assignee: Etienne Chauchot
>            Priority: Minor
>
> Currently TextIO use {{\r}} {{\n}} or {{\r\n}} or a mix of the two to split a text file
into PCollection elements. It might happen that a record is spread across more than one line.
In that case we should be able to specify a custom record delimiter to be used in place of
the default ones.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Mime
View raw message