beam-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Uwe Jugel (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (BEAM-1909) BigQuery read transform fails for DirectRunner when querying non-US regions
Date Mon, 22 May 2017 13:11:04 GMT

    [ https://issues.apache.org/jira/browse/BEAM-1909?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16019562#comment-16019562
] 

Uwe Jugel commented on BEAM-1909:
---------------------------------

Here are my latest test results regarding this issue:

*Experiments*
# I just tried and failed to query across regions:
{code:sql}
SELECT a.user_id FROM `test_dummy_eu.user_details` a, `test_dummy_us.user_details` b WHERE
a.user_id = b.user_id
-- Error: Cannot process data across locations: EU,US
{code}
# Since we cannot query across regions, I tried to determine the single location/region of
the data source(s). Therefore, I tried to dry-run the query and then check the location of
the bq-internal temp table. However, this does not work, as the temp table always reports
{{None}}, i.e., US as location, even if the source table is in an EU dataset.
# However, we can still *transfer the data to our temp table from the queries own temp table
using a {{CopyJob}}* that works across regions. Here is a gist that demonstrates how to do
this via the BigQuery Python SDK: https://gist.github.com/ubunatic/29352bc2c9ddfc33163cfac47bc1e4d6

*Note*:
I believe, using a {{CopyJob}} this is the appropriate way of copying any table to a temp
table, also and especially for non-query sources, which we currently query with a {{SELECT
*}}, which may be billed to the user (\?), even if it should be covered by the free data export
quotas (see here https://cloud.google.com/bigquery/docs/exporting-data and here: https://cloud.google.com/bigquery/pricing#free)

*Links:*

||Description||Link||
|CopyJob (py)| https://github.com/GoogleCloudPlatform/google-cloud-python/blob/master/bigquery/google/cloud/bigquery/job.py|
|copy job (API)| https://cloud.google.com/bigquery/docs/reference/rest/v2/jobs|
|BQ-read in DataFlow == BQ-export| https://cloud.google.com/bigquery/docs/exporting-data|
|free BQ-export| https://cloud.google.com/bigquery/pricing#free|
|costly(\?) "SELECT *" for non-queries| https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/gcp/bigquery.py|




> BigQuery read transform fails for DirectRunner when querying non-US regions
> ---------------------------------------------------------------------------
>
>                 Key: BEAM-1909
>                 URL: https://issues.apache.org/jira/browse/BEAM-1909
>             Project: Beam
>          Issue Type: Bug
>          Components: sdk-py
>            Reporter: Chamikara Jayalath
>
> See: http://stackoverflow.com/questions/42135002/google-dataflow-cannot-read-and-write-in-different-locations-python-sdk-v0-5-5/42144748?noredirect=1#comment73621983_42144748
> This should be fixed by creating the temp dataset and table in the correct region.
> cc: [~sb2nov]



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Mime
View raw message