Return-Path: X-Original-To: apmail-beam-commits-archive@minotaur.apache.org Delivered-To: apmail-beam-commits-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 4D5C619EBB for ; Fri, 22 Apr 2016 02:26:15 +0000 (UTC) Received: (qmail 81337 invoked by uid 500); 22 Apr 2016 02:26:15 -0000 Delivered-To: apmail-beam-commits-archive@beam.apache.org Received: (qmail 81286 invoked by uid 500); 22 Apr 2016 02:26:15 -0000 Mailing-List: contact commits-help@beam.incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@beam.incubator.apache.org Delivered-To: mailing list commits@beam.incubator.apache.org Received: (qmail 81277 invoked by uid 99); 22 Apr 2016 02:26:15 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd1-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 22 Apr 2016 02:26:15 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd1-us-west.apache.org (ASF Mail Server at spamd1-us-west.apache.org) with ESMTP id BCDDAC256B for ; Fri, 22 Apr 2016 02:26:14 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd1-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: -5.016 X-Spam-Level: X-Spam-Status: No, score=-5.016 tagged_above=-999 required=6.31 tests=[KAM_LAZY_DOMAIN_SECURITY=1, RCVD_IN_DNSWL_HI=-5, RCVD_IN_MSPIKE_H3=-0.01, RCVD_IN_MSPIKE_WL=-0.01, RP_MATCHES_RCVD=-0.996] autolearn=disabled Received: from mx2-lw-us.apache.org ([10.40.0.8]) by localhost (spamd1-us-west.apache.org [10.40.0.7]) (amavisd-new, port 10024) with ESMTP id oP6qX4AVS5Ol for ; Fri, 22 Apr 2016 02:26:14 +0000 (UTC) Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by mx2-lw-us.apache.org (ASF Mail Server at mx2-lw-us.apache.org) with SMTP id 878BA5F239 for ; Fri, 22 Apr 2016 02:26:13 +0000 (UTC) Received: (qmail 81196 invoked by uid 99); 22 Apr 2016 02:26:13 -0000 Received: from arcas.apache.org (HELO arcas) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 22 Apr 2016 02:26:13 +0000 Received: from arcas.apache.org (localhost [127.0.0.1]) by arcas (Postfix) with ESMTP id CDBB22C1F4E for ; Fri, 22 Apr 2016 02:26:12 +0000 (UTC) Date: Fri, 22 Apr 2016 02:26:12 +0000 (UTC) From: "ASF GitHub Bot (JIRA)" To: commits@beam.incubator.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (BEAM-50) BigQueryIO.Write: reimplement in Java MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/BEAM-50?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15253206#comment-15253206 ] ASF GitHub Bot commented on BEAM-50: ------------------------------------ Github user asfgit closed the pull request at: https://github.com/apache/incubator-beam/pull/192 > BigQueryIO.Write: reimplement in Java > ------------------------------------- > > Key: BEAM-50 > URL: https://issues.apache.org/jira/browse/BEAM-50 > Project: Beam > Issue Type: New Feature > Components: sdk-java-gcp > Reporter: Daniel Halperin > Assignee: Pei He > Priority: Minor > > BigQueryIO.Write is currently implemented in a somewhat hacky way. > Unbounded sink: > * The DirectPipelineRunner and the DataflowPipelineRunner use StreamingWriteFn and BigQueryTableInserter to insert rows using BigQuery's streaming writes API. > Bounded sink: > * The DirectPipelineRunner still uses streaming writes. > * The DataflowPipelineRunner uses a different code path in the Google Cloud Dataflow service that writes to GCS and the initiates a BigQuery load job. > * Per-window table destinations do not work scalably. (See Beam-XXX). > We need to reimplement BigQueryIO.Write fully in Java code in order to support other runners in a scalable way. > I additionally suggest that we revisit the design of the BigQueryIO sink in the process. A short list: > * Do not use TableRow as the default value for rows. It could be Map with well-defined types, for example, or an Avro GenericRecord. Dropping TableRow will get around a variety of issues with types, fields named 'f', etc., and it will also reduce confusion as we use TableRow objects differently than usual (for good reason). > * Possibly support not-knowing the schema until pipeline execution time. > * Our builders for BigQueryIO.Write are useful and we should keep them. Where possible we should also allow users to provide the JSON objects that configure the underlying table creation, write disposition, etc. This would let users directly control things like table expiration time, table location, etc., Would also optimistically let users take advantage of some new BigQuery features without code changes. > * We could choose between streaming write API and load jobs based on user preference or dynamic job properties . We could use streaming write in a batch pipeline if the data is small. We could use load jobs in streaming pipelines if the windows are large enough to make this practical. > * When issuing BigQuery load jobs, we could leave files in GCS if the import fails, so that data errors can be debugged. > * We should make per-window table writes scalable in batch. > Caveat, possibly blocker: > * (Beam-XXX): cleanup and temp file management. One advantage of the Google Cloud Dataflow implementation of BigQueryIO.Write is cleanup: we ensure that intermediate files are deleted when bundles or jobs fail, etc. Beam does not currently support this. -- This message was sent by Atlassian JIRA (v6.3.4#6332)