beam-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ben Chambers (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (BEAM-2708) Support for pbzip2 in IO
Date Tue, 01 Aug 2017 20:36:00 GMT

    [ https://issues.apache.org/jira/browse/BEAM-2708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16109703#comment-16109703
] 

Ben Chambers edited comment on BEAM-2708 at 8/1/17 8:35 PM:
------------------------------------------------------------

This looks to be a bug in the CompressedSource support for BZIP2. Specifically, we create
the stream with:

{code:java}
        return Channels.newChannel(
            new BZip2CompressorInputStream(Channels.newInputStream(channel)));
{code}

Which defaults to {{decompressConcatenated = false}}. As a result only the first "stream"
within the {{bz2}} file is actually read.

The fix is easy -- change that code to:

{code:java}

        return Channels.newChannel(
            new BZip2CompressorInputStream(Channels.newInputStream(channel), true));
{code}

But coming up with a test is a bit harder.


was (Author: bchambers):
This looks to be a bug in the CompressedSource support for BZIP2. Specifically, we create
the stream with:

{{
        return Channels.newChannel(
            new BZip2CompressorInputStream(Channels.newInputStream(channel)));
}}

Which defaults to {{decompressConcatenated = false}}. As a result only the first "stream"
within the {{bz2}} file is actually read.

The fix is easy -- change that code to:

{{
        return Channels.newChannel(
            new BZip2CompressorInputStream(Channels.newInputStream(channel), true));
}}

But coming up with a test is a bit harder.

> Support for pbzip2 in IO
> ------------------------
>
>                 Key: BEAM-2708
>                 URL: https://issues.apache.org/jira/browse/BEAM-2708
>             Project: Beam
>          Issue Type: Bug
>          Components: sdk-java-extensions, sdk-py
>            Reporter: Pablo Estrada
>            Assignee: Ben Chambers
>
> I'm not sure which components to file this against. A user has observed that pbzip2 files
are not being properly decompressed:
> https://stackoverflow.com/questions/45439117/google-dataflow-only-partly-uncompressing-files-compressed-with-pbzip2



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Mime
View raw message