beam-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Eugene Kirpichov (JIRA)" <>
Subject [jira] [Commented] (BEAM-2826) Need to generate a single XML file when write is performed on small amount of data
Date Wed, 30 Aug 2017 18:13:01 GMT


Eugene Kirpichov commented on BEAM-2826:

The solution to this bug would be either augmenting XmlIO.write() with similar builders like
TextIO and AvroIO (controlling sharding, and potentially also windowed writes, dynamic destinations),
or figuring out a good way to do it generally for all file-based sinks. I'm not sure if the
WriteFiles transform is in sufficient shape to be used like that.

I suppose we can start with adding sharding controls to XmlIO.write() - that'd be an easy
starter task.

> Need to generate a single XML file when write is performed on small amount of data
> ----------------------------------------------------------------------------------
>                 Key: BEAM-2826
>                 URL:
>             Project: Beam
>          Issue Type: New Feature
>          Components: beam-model
>    Affects Versions: 2.0.0
>            Reporter: Balajee Venkatesh
>            Assignee: Kenneth Knowles
> I'm trying to write an XML file where the source is a text file stored in GCS. The code
is running fine but instead of a single XML file, it is generating multiple XML files. (No.
of XML files seem to follow total no. of records present in source text file). I have observed
this scenario while using 'DataflowRunner'.
> When I run the same code in local then two files get generated. First one contains all
the records with proper elements and the second one contains only opening and closing root
> As I learnt,it is expected that it may produce multiple files: e.g. if the runner chooses
to process your data parallelizing it into 3 tasks ("bundles"), you'll get 3 files. Some of
the parts may turn out empty in some cases, but the total data written will always add up
to the expected data.

This message was sent by Atlassian JIRA

View raw message