airflow-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tobias Feldhaus <>
Subject Re: Possible Bug (?) in BigQueryOperator - Missing data when writing to a partitioned table
Date Thu, 28 Sep 2017 10:07:47 GMT
I think I found the issue. I was rerunning everything again and I found that now
the respective date was there, but another date was missing. After some
investigations I stumbled upon this:

Airflow simply didnt process some days of the month (August) that I was
reprocessing. It simply didnt process August 24th yesterday, and now it was 
missing August 17th and 18th!

[Screenshot for Airflow interface showing the run for 2017-08-16 run 17/18
are missing, 19 is the next one:

[Screenshot for Airflow interface showing the run for 2017-08-19:]

What could be the reason for this? Did the clearing command via the
webinterface maybe fail? Why are the days no longer shown in the webinterface 
at all?

On 27.09.2017, 23:20, "Tobias Feldhaus" <> wrote:

    I am also skeptical, but I want to be sure - the next thing I would do is stepping through
with a debugger to see if the query gets altered in any way before it’s send out. Is it
possible to step through with pdb when triggering via “airflow run” ?
    On 27.09.2017, 22:56, "Chris Riccomini" <<>>
    I am highly skeptical that it's the library.
    On Wed, Sep 27, 2017 at 1:50 PM, Tobias Feldhaus <<>>
    This was exactly my point. Before I dig deeper I want to build a very minimum PythonOperator
that uses the new library as I am currently
     comparing apples with oranges (same query, same data, different libraries). Although
it really puzzles me how a different library can yield different (read as: some is missing)
data – when it’s job is just to execute a query and not pulling and transforming it.
    On 27.09.2017, 19:43, "Chris Riccomini" <<>>
        Interesting. Just saw:
        > This client library is supported but in maintenance mode only. We are
        fixing necessary bugs and adding essential features to ensure this library
        continues to meet your needs for accessing Google APIs. Non-critical issues
        will be closed. Any issue may be reopened if it is causing ongoing problems.
        Looks like we might want to migrate at some point. It'll be a big change.
        On Wed, Sep 27, 2017 at 10:41 AM, Chris Riccomini <<>>
        > AFAIK, google-api-python-client is not in maintenance mode. In fact, I
        > believe the idiomatic Python library (google-cloud-python) is built off of google-api-python-client,
        > I believe. I have spoken with several Google cloud PMs who have pointed me
        > at google-api-python-client as the canonical library to use, and the one
        > that receives updates for new products first (before google-cloud-python).
        > On Wed, Sep 27, 2017 at 10:34 AM, Tobias Feldhaus <
        >> Sounds like a possible solution, however to avoid hitting this problem
        >> I’ve deleted all the tables before rerunning stuff. I think it might have
        >> to do with the library. Airflow uses google-api-python-client which is in
        >> maintenance mode and Google suggests switching to google-cloud-python. I
        >> will write a PythonOperator DAG tomorrow and will check DAG against DAG
        >> then to see if the library could be the problem.
        >> On 27.09.2017, 19:15, "Chris Riccomini" <<>>
        >>     Is it possible that you were getting a cache hit with the BQ operator?
        >> ry-query-cache-api
        >>     The operator does not currently expose this flag, and I couldn't find
        >>     whether the cache defaults to on or off for insert-job API.
        >>     On Wed, Sep 27, 2017 at 9:41 AM, Tobias Feldhaus <
        >>     > I’ve created a table with only the missing value in the exact
        >>     > partition, and then it’s going through. Could it be that the volume
        >> of the
        >>     > data plays a role or the client libraries maybe?
        >>     >
        >>     > On 27.09.2017, 17:46, "Tobias Feldhaus" <
        >>     > wrote:
        >>     >
        >>     >     Hi,
        >>     >
        >>     >
        >>     >     I am tracing a bug in one of our data pipelines and I narrowed
        >> it down
        >>     > to some small number of events not being in a table (using Airflow
        >> 1.8.2).
        >>     >     After running the query myself that airflow executed
        >> interactively, I
        >>     > saw the missing entry. When airflow executed the same query, and
        >> writes the
        >>     > results to a partitioned table in BQ it was missing in that
        >> destination
        >>     > table.
        >>     >     I’ve tried different scenarios now several times and the only
        >>     > explanation or difference I can come up with, is that airflow
        >> _might_ be
        >>     > that using partitioned tables is not fully supported or there is
        >> some weird
        >>     > bug in the bigquery-python implementation.
        >>     >
        >>     >     When deleting the table and recreating it and reloading the
        >> complete
        >>     > date with airflow the data is still missing. When reloading a
        >> single day,
        >>     > it is also missing. I’ve created a python script to execute the
        >> exact same
        >>     > query and it works as expected.
        >>     >
        >>     >     Any advice how to track this down further? Is this a known
        >> issue?
        >>     >
        >>     >     Best,
        >>     >     Tobias
        >>     >
        >>     >
        >>     >
        >>     >
        >>     >

View raw message