airflow-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chris Riccomini <criccom...@apache.org>
Subject Re: Possible Bug (?) in BigQueryOperator - Missing data when writing to a partitioned table
Date Wed, 27 Sep 2017 17:41:42 GMT
AFAIK, google-api-python-client is not in maintenance mode. In fact, I
believe the idiomatic Python library (google-cloud-python) is built
off of google-api-python-client,
I believe. I have spoken with several Google cloud PMs who have pointed me
at google-api-python-client as the canonical library to use, and the one
that receives updates for new products first (before google-cloud-python).

On Wed, Sep 27, 2017 at 10:34 AM, Tobias Feldhaus <
Tobias.Feldhaus@localsearch.ch> wrote:

> Sounds like a possible solution, however to avoid hitting this problem
> I’ve deleted all the tables before rerunning stuff. I think it might have
> to do with the library. Airflow uses google-api-python-client which is in
> maintenance mode and Google suggests switching to google-cloud-python. I
> will write a PythonOperator DAG tomorrow and will check DAG against DAG
> then to see if the library could be the problem.
>
> On 27.09.2017, 19:15, "Chris Riccomini" <criccomini@apache.org> wrote:
>
>     Is it possible that you were getting a cache hit with the BQ operator?
>
>     https://cloud.google.com/bigquery/docs/cached-results#
> bigquery-query-cache-api
>
>     The operator does not currently expose this flag, and I couldn't find
>     whether the cache defaults to on or off for insert-job API.
>
>     On Wed, Sep 27, 2017 at 9:41 AM, Tobias Feldhaus <
>     Tobias.Feldhaus@localsearch.ch> wrote:
>
>     > I’ve created a table with only the missing value in the exact same
>     > partition, and then it’s going through. Could it be that the volume
> of the
>     > data plays a role or the client libraries maybe?
>     >
>     > On 27.09.2017, 17:46, "Tobias Feldhaus" <
> Tobias.Feldhaus@localsearch.ch>
>     > wrote:
>     >
>     >     Hi,
>     >
>     >
>     >     I am tracing a bug in one of our data pipelines and I narrowed
> it down
>     > to some small number of events not being in a table (using Airflow
> 1.8.2).
>     >     After running the query myself that airflow executed
> interactively, I
>     > saw the missing entry. When airflow executed the same query, and
> writes the
>     > results to a partitioned table in BQ it was missing in that
> destination
>     > table.
>     >     I’ve tried different scenarios now several times and the only
>     > explanation or difference I can come up with, is that airflow
> _might_ be
>     > that using partitioned tables is not fully supported or there is
> some weird
>     > bug in the bigquery-python implementation.
>     >
>     >     When deleting the table and recreating it and reloading the
> complete
>     > date with airflow the data is still missing. When reloading a single
> day,
>     > it is also missing. I’ve created a python script to execute the
> exact same
>     > query and it works as expected.
>     >
>     >     Any advice how to track this down further? Is this a known issue?
>     >
>     >     Best,
>     >     Tobias
>     >
>     >
>     >
>     >
>     >
>
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message