airflow-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tobias Feldhaus <Tobias.Feldh...@localsearch.ch>
Subject Re: Possible Bug (?) in BigQueryOperator - Missing data when writing to a partitioned table
Date Wed, 27 Sep 2017 17:34:22 GMT
Sounds like a possible solution, however to avoid hitting this problem I’ve deleted all the
tables before rerunning stuff. I think it might have to do with the library. Airflow uses
google-api-python-client which is in maintenance mode and Google suggests switching to google-cloud-python.
I will write a PythonOperator DAG tomorrow and will check DAG against DAG then to see if the
library could be the problem.

On 27.09.2017, 19:15, "Chris Riccomini" <criccomini@apache.org> wrote:

    Is it possible that you were getting a cache hit with the BQ operator?
    
    https://cloud.google.com/bigquery/docs/cached-results#bigquery-query-cache-api
    
    The operator does not currently expose this flag, and I couldn't find
    whether the cache defaults to on or off for insert-job API.
    
    On Wed, Sep 27, 2017 at 9:41 AM, Tobias Feldhaus <
    Tobias.Feldhaus@localsearch.ch> wrote:
    
    > I’ve created a table with only the missing value in the exact same
    > partition, and then it’s going through. Could it be that the volume of the
    > data plays a role or the client libraries maybe?
    >
    > On 27.09.2017, 17:46, "Tobias Feldhaus" <Tobias.Feldhaus@localsearch.ch>
    > wrote:
    >
    >     Hi,
    >
    >
    >     I am tracing a bug in one of our data pipelines and I narrowed it down
    > to some small number of events not being in a table (using Airflow 1.8.2).
    >     After running the query myself that airflow executed interactively, I
    > saw the missing entry. When airflow executed the same query, and writes the
    > results to a partitioned table in BQ it was missing in that destination
    > table.
    >     I’ve tried different scenarios now several times and the only
    > explanation or difference I can come up with, is that airflow _might_ be
    > that using partitioned tables is not fully supported or there is some weird
    > bug in the bigquery-python implementation.
    >
    >     When deleting the table and recreating it and reloading the complete
    > date with airflow the data is still missing. When reloading a single day,
    > it is also missing. I’ve created a python script to execute the exact same
    > query and it works as expected.
    >
    >     Any advice how to track this down further? Is this a known issue?
    >
    >     Best,
    >     Tobias
    >
    >
    >
    >
    >
    

Mime
View raw message