Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 92B7C200D5B for ; Wed, 29 Nov 2017 02:25:05 +0100 (CET) Received: by cust-asf.ponee.io (Postfix) id 91629160C19; Wed, 29 Nov 2017 01:25:05 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id D73CB160BE7 for ; Wed, 29 Nov 2017 02:25:04 +0100 (CET) Received: (qmail 37402 invoked by uid 500); 29 Nov 2017 01:25:04 -0000 Mailing-List: contact commits-help@beam.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@beam.apache.org Delivered-To: mailing list commits@beam.apache.org Received: (qmail 37392 invoked by uid 99); 29 Nov 2017 01:25:04 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd2-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 29 Nov 2017 01:25:04 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd2-us-west.apache.org (ASF Mail Server at spamd2-us-west.apache.org) with ESMTP id 33FC51A13D5 for ; Wed, 29 Nov 2017 01:25:03 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd2-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: -99.202 X-Spam-Level: X-Spam-Status: No, score=-99.202 tagged_above=-999 required=6.31 tests=[KAM_ASCII_DIVIDERS=0.8, RP_MATCHES_RCVD=-0.001, SPF_PASS=-0.001, USER_IN_WHITELIST=-100] autolearn=disabled Received: from mx1-lw-us.apache.org ([10.40.0.8]) by localhost (spamd2-us-west.apache.org [10.40.0.9]) (amavisd-new, port 10024) with ESMTP id VTR-p2A_jnGH for ; Wed, 29 Nov 2017 01:25:02 +0000 (UTC) Received: from mailrelay1-us-west.apache.org (mailrelay1-us-west.apache.org [209.188.14.139]) by mx1-lw-us.apache.org (ASF Mail Server at mx1-lw-us.apache.org) with ESMTP id 076405F232 for ; Wed, 29 Nov 2017 01:25:02 +0000 (UTC) Received: from jira-lw-us.apache.org (unknown [207.244.88.139]) by mailrelay1-us-west.apache.org (ASF Mail Server at mailrelay1-us-west.apache.org) with ESMTP id 2F052E256A for ; Wed, 29 Nov 2017 01:25:01 +0000 (UTC) Received: from jira-lw-us.apache.org (localhost [127.0.0.1]) by jira-lw-us.apache.org (ASF Mail Server at jira-lw-us.apache.org) with ESMTP id 761F2241CA for ; Wed, 29 Nov 2017 01:25:00 +0000 (UTC) Date: Wed, 29 Nov 2017 01:25:00 +0000 (UTC) From: "Eugene Kirpichov (JIRA)" To: commits@beam.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (BEAM-3268) getPerDestinationOutputFilenames() is getting processed before write is finished on dataflow runner MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 archived-at: Wed, 29 Nov 2017 01:25:05 -0000 [ https://issues.apache.org/jira/browse/BEAM-3268?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16269876#comment-16269876 ] Eugene Kirpichov commented on BEAM-3268: ---------------------------------------- Yeah this is a bug, because the transforms that produce perDestinationOutputFilenames produce them before the files are actually copied, e.g. https://github.com/apache/beam/blob/f8d8ff14c49e4dfb15541f4b73aa66513c9a9d23/sdks/java/core/src/main/java/org/apache/beam/sdk/io/WriteFiles.java#L942 One fix is to reorder that code (and the respective code in FinalizeWindowedFn). Another fix is to insert a reshuffle somewhere around https://github.com/apache/beam/blob/f8d8ff14c49e4dfb15541f4b73aa66513c9a9d23/sdks/java/core/src/main/java/org/apache/beam/sdk/io/WriteFiles.java#L816 , which is less brittle - I would prefer the latter. > getPerDestinationOutputFilenames() is getting processed before write is finished on dataflow runner > --------------------------------------------------------------------------------------------------- > > Key: BEAM-3268 > URL: https://issues.apache.org/jira/browse/BEAM-3268 > Project: Beam > Issue Type: Bug > Components: runner-dataflow > Affects Versions: 2.3.0 > Reporter: Kamil Szewczyk > Assignee: Reuven Lax > Attachments: comparison.png > > > While running filebased-io-test we found dataflow-runnner misbehaving. We run tests using single pipeline and without using Reshuffling between writing and reading dataflow jobs are unsuccessful because the runner tries to access the files that were not created yet. > On the picture the difference between execution of writting is presented. On the left there is working example with Reshuffling added and on the right without it. > !comparison.png|thumbnail! > Steps to reproduce: substitute your-bucket-name wit your valid bucket. > {code:java} > mvn -e -Pio-it verify -pl sdks/java/io/file-based-io-tests -DintegrationTestPipelineOptions='["--runner=dataflow", "--filenamePrefix=gs://your-bucket-name/TEXTIO_IT"]' -Pdataflow-runner > {code} > Then look on the cloud console and job should fail. > Now add Reshuffling to sdks/java/io/file-based-io-tests/src/test/java/org/apache/beam/sdk/io/text/TextIOIT.java as in the example. > {code:java} > .getPerDestinationOutputFilenames().apply(Values.create()) > .apply(Reshuffle.viaRandomKey()); > PCollection consolidatedHashcode = testFilenames > {code} > and trigger previously used maven command to see it working in the console right now. -- This message was sent by Atlassian JIRA (v6.4.14#64029)