Return-Path: X-Original-To: apmail-beam-commits-archive@minotaur.apache.org Delivered-To: apmail-beam-commits-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 9CE171858A for ; Mon, 4 Apr 2016 20:21:27 +0000 (UTC) Received: (qmail 14843 invoked by uid 500); 4 Apr 2016 20:21:27 -0000 Delivered-To: apmail-beam-commits-archive@beam.apache.org Received: (qmail 14805 invoked by uid 500); 4 Apr 2016 20:21:27 -0000 Mailing-List: contact commits-help@beam.incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@beam.incubator.apache.org Delivered-To: mailing list commits@beam.incubator.apache.org Received: (qmail 14794 invoked by uid 99); 4 Apr 2016 20:21:27 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd1-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 04 Apr 2016 20:21:27 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd1-us-west.apache.org (ASF Mail Server at spamd1-us-west.apache.org) with ESMTP id 29AC9C056E for ; Mon, 4 Apr 2016 20:21:27 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd1-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: -5.016 X-Spam-Level: X-Spam-Status: No, score=-5.016 tagged_above=-999 required=6.31 tests=[KAM_LAZY_DOMAIN_SECURITY=1, RCVD_IN_DNSWL_HI=-5, RCVD_IN_MSPIKE_H3=-0.01, RCVD_IN_MSPIKE_WL=-0.01, RP_MATCHES_RCVD=-0.996] autolearn=disabled Received: from mx2-lw-us.apache.org ([10.40.0.8]) by localhost (spamd1-us-west.apache.org [10.40.0.7]) (amavisd-new, port 10024) with ESMTP id oEK6iGkXJzKI for ; Mon, 4 Apr 2016 20:21:26 +0000 (UTC) Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by mx2-lw-us.apache.org (ASF Mail Server at mx2-lw-us.apache.org) with SMTP id 309C75F5CD for ; Mon, 4 Apr 2016 20:21:26 +0000 (UTC) Received: (qmail 12603 invoked by uid 99); 4 Apr 2016 20:21:25 -0000 Received: from arcas.apache.org (HELO arcas) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 04 Apr 2016 20:21:25 +0000 Received: from arcas.apache.org (localhost [127.0.0.1]) by arcas (Postfix) with ESMTP id 84C342C1F58 for ; Mon, 4 Apr 2016 20:21:25 +0000 (UTC) Date: Mon, 4 Apr 2016 20:21:25 +0000 (UTC) From: "Luke Cwik (JIRA)" To: commits@beam.incubator.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Assigned] (BEAM-167) TextIO can't read concatenated gzip files MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/BEAM-167?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Luke Cwik reassigned BEAM-167: ------------------------------ Assignee: Luke Cwik > TextIO can't read concatenated gzip files > ----------------------------------------- > > Key: BEAM-167 > URL: https://issues.apache.org/jira/browse/BEAM-167 > Project: Beam > Issue Type: Bug > Components: sdk-java-extensions > Reporter: Eugene Kirpichov > Assignee: Luke Cwik > > $ cat < header.csv > a,b,c > END > $ cat < body.csv > 1,2,3 > 4,5,6 > 7,8,9 > END > $ gzip -c header.csv > file.gz > $ gzip -c body.csv >> file.gz > The file is well-formed: > $ gzip -dc file.gz > a,b,c > 1,2,3 > 4,5,6 > 7,8,9 > However, TextIO.Read.from("/path/to/file.gz") will read only "a,b,c" - reproducible even when the file is on local disk and with the DirectPipelineRunner. > The bug is in CompressedSource. It uses GzipCompressorInputStream, which by default reads only the first gzip stream in the file, but has an option to read all of them. Previously (in Dataflow SDK 1.4.0) we used GZIPInputStream which reads all streams. -- This message was sent by Atlassian JIRA (v6.3.4#6332)