beam-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Daniel Halperin (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (BEAM-55) Allow users to compress FileBasedSink output files
Date Thu, 29 Sep 2016 00:08:20 GMT

    [ https://issues.apache.org/jira/browse/BEAM-55?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15531312#comment-15531312
] 

Daniel Halperin commented on BEAM-55:
-------------------------------------

Note that there is a good reason this was not originally supported: Compressing output files
is generally terrible for downstream processing. Most consumers of files perform very poorly
when reading from them (Examples: Dataflow and Google BigQuery are both unable to parallelize
reads from compressed files).

At Google, we highly discourage compressed data but prefer, e.g., block-compressed formats
like Avro that combine compression and the ability to seek/split/parallelize reading. AvroIO
DOES support compression.

> Allow users to compress FileBasedSink output files
> --------------------------------------------------
>
>                 Key: BEAM-55
>                 URL: https://issues.apache.org/jira/browse/BEAM-55
>             Project: Beam
>          Issue Type: New Feature
>          Components: sdk-java-core
>            Reporter: Daniel Halperin
>            Priority: Minor
>
> FileBasedSink (also TextIO.Write, AvroIO.Write, etc). does not have an option for compressing
its output.
> In general, we discourage compression because it limits or blocks scalably reading from
a file in parallel. However, users may want it -- so we should support the option (with appropriate
warnings).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message