beam-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Daniel Halperin (JIRA)" <>
Subject [jira] [Commented] (BEAM-55) Allow users to compress FileBasedSink output files
Date Thu, 29 Sep 2016 00:08:20 GMT


Daniel Halperin commented on BEAM-55:

Note that there is a good reason this was not originally supported: Compressing output files
is generally terrible for downstream processing. Most consumers of files perform very poorly
when reading from them (Examples: Dataflow and Google BigQuery are both unable to parallelize
reads from compressed files).

At Google, we highly discourage compressed data but prefer, e.g., block-compressed formats
like Avro that combine compression and the ability to seek/split/parallelize reading. AvroIO
DOES support compression.

> Allow users to compress FileBasedSink output files
> --------------------------------------------------
>                 Key: BEAM-55
>                 URL:
>             Project: Beam
>          Issue Type: New Feature
>          Components: sdk-java-core
>            Reporter: Daniel Halperin
>            Priority: Minor
> FileBasedSink (also TextIO.Write, AvroIO.Write, etc). does not have an option for compressing
its output.
> In general, we discourage compression because it limits or blocks scalably reading from
a file in parallel. However, users may want it -- so we should support the option (with appropriate

This message was sent by Atlassian JIRA

View raw message