beam-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ryan Skraba (JIRA)" <>
Subject [jira] [Commented] (BEAM-2802) TextIO should allow specifying a custom delimiter
Date Tue, 29 Aug 2017 10:13:00 GMT


Ryan Skraba commented on BEAM-2802:

Hello!  Just to chime in -- I'm part of the team that asked √Čtienne to investigate this.
 We have some experience with data formats used by customers to contain tabular data in text

I couldn't pass judgement on whether they're "good ideas", just that this is a valid use case!
 There's _probably_ a satisfactory standard for CSV-like data somewhere, but it's definitely
not universal.  In any case, a lot of CSV formats just aren't appropriate for splitting or
big data (looking at you, RFC-4801).

The crux of the problem is having newlines inside the record value (or more general, having
the record-delimiter inside the field).  We've encountered solutions like using {{\000}} for
record delimiters, or control characters outside of ascii data (like the {{^B}} above used
to distinguish *real* newline record delimiters from newlines in the record).  We've encountered
record separators like {{\n\-\-\n}} to separate records on different lines, or just {{\-\-}}
for a stream of whitespace-free data.  All of these are human-readable without much difficulty,
and (unfortunately) easy enough to have been invented and implemented in existing tools and

I've mentioned tabular and CSV-like data -- we're only interested in having TextIO extracting
the record correctly here.  Splitting the record into fields can and should occur downstream
in a ParDo.

All of the existing features of TextIO (such as compression, watching, dynamic destinations)
apply to text files that use a custom delimiter, so it seems like a natural place to add this
functionality.  Custom record delimiters are a common option in unix command line tools, as
well as configurable in the Hadoop TextInputFormat so it shouldn't be unexpected or confusing
for the user to have this option in TextIO.

The performance impact should be negligible with √Čtienne's proposal above.  I would doubt
that there would be measurable impact if you aren't using a custom delimiter (although this
can be checked).

**TL;DR:** These formats are found "in the wild" and that a fixed, multi-byte custom delimiter
is probably the single best step to connecting a lot of these formats into a Beam job.

> TextIO should allow specifying a custom delimiter
> -------------------------------------------------
>                 Key: BEAM-2802
>                 URL:
>             Project: Beam
>          Issue Type: New Feature
>          Components: sdk-java-extensions
>            Reporter: Etienne Chauchot
>            Assignee: Etienne Chauchot
>            Priority: Minor
> Currently TextIO use {{\r}} {{\n}} or {{\r\n}} or a mix of the two to split a text file
into PCollection elements. It might happen that a record is spread across more than one line.
In that case we should be able to specify a custom record delimiter to be used in place of
the default ones.

This message was sent by Atlassian JIRA

View raw message