airflow-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tobias Feldhaus <Tobias.Feldh...@localsearch.ch>
Subject Re: Possible Bug (?) in BigQueryOperator - Missing data when writing to a partitioned table
Date Wed, 27 Sep 2017 20:50:46 GMT
This was exactly my point. Before I dig deeper I want to build a very minimum PythonOperator
that uses the new library as I am currently
 comparing apples with oranges (same query, same data, different libraries). Although it really
puzzles me how a different library can yield different (read as: some is missing) data –
when it’s job is just to execute a query and not pulling and transforming it.


On 27.09.2017, 19:43, "Chris Riccomini" <criccomini@apache.org> wrote:

    Interesting. Just saw:
    
    https://github.com/google/google-api-python-client
    
    > This client library is supported but in maintenance mode only. We are
    fixing necessary bugs and adding essential features to ensure this library
    continues to meet your needs for accessing Google APIs. Non-critical issues
    will be closed. Any issue may be reopened if it is causing ongoing problems.
    
    Looks like we might want to migrate at some point. It'll be a big change.
    <https://github.com/google/google-api-python-client#about>
    
    On Wed, Sep 27, 2017 at 10:41 AM, Chris Riccomini <criccomini@apache.org>
    wrote:
    
    > AFAIK, google-api-python-client is not in maintenance mode. In fact, I
    > believe the idiomatic Python library (google-cloud-python) is built off of google-api-python-client,
    > I believe. I have spoken with several Google cloud PMs who have pointed me
    > at google-api-python-client as the canonical library to use, and the one
    > that receives updates for new products first (before google-cloud-python).
    >
    > On Wed, Sep 27, 2017 at 10:34 AM, Tobias Feldhaus <
    > Tobias.Feldhaus@localsearch.ch> wrote:
    >
    >> Sounds like a possible solution, however to avoid hitting this problem
    >> I’ve deleted all the tables before rerunning stuff. I think it might have
    >> to do with the library. Airflow uses google-api-python-client which is in
    >> maintenance mode and Google suggests switching to google-cloud-python. I
    >> will write a PythonOperator DAG tomorrow and will check DAG against DAG
    >> then to see if the library could be the problem.
    >>
    >> On 27.09.2017, 19:15, "Chris Riccomini" <criccomini@apache.org> wrote:
    >>
    >>     Is it possible that you were getting a cache hit with the BQ operator?
    >>
    >>     https://cloud.google.com/bigquery/docs/cached-results#bigque
    >> ry-query-cache-api
    >>
    >>     The operator does not currently expose this flag, and I couldn't find
    >>     whether the cache defaults to on or off for insert-job API.
    >>
    >>     On Wed, Sep 27, 2017 at 9:41 AM, Tobias Feldhaus <
    >>     Tobias.Feldhaus@localsearch.ch> wrote:
    >>
    >>     > I’ve created a table with only the missing value in the exact same
    >>     > partition, and then it’s going through. Could it be that the volume
    >> of the
    >>     > data plays a role or the client libraries maybe?
    >>     >
    >>     > On 27.09.2017, 17:46, "Tobias Feldhaus" <
    >> Tobias.Feldhaus@localsearch.ch>
    >>     > wrote:
    >>     >
    >>     >     Hi,
    >>     >
    >>     >
    >>     >     I am tracing a bug in one of our data pipelines and I narrowed
    >> it down
    >>     > to some small number of events not being in a table (using Airflow
    >> 1.8.2).
    >>     >     After running the query myself that airflow executed
    >> interactively, I
    >>     > saw the missing entry. When airflow executed the same query, and
    >> writes the
    >>     > results to a partitioned table in BQ it was missing in that
    >> destination
    >>     > table.
    >>     >     I’ve tried different scenarios now several times and the only
    >>     > explanation or difference I can come up with, is that airflow
    >> _might_ be
    >>     > that using partitioned tables is not fully supported or there is
    >> some weird
    >>     > bug in the bigquery-python implementation.
    >>     >
    >>     >     When deleting the table and recreating it and reloading the
    >> complete
    >>     > date with airflow the data is still missing. When reloading a
    >> single day,
    >>     > it is also missing. I’ve created a python script to execute the
    >> exact same
    >>     > query and it works as expected.
    >>     >
    >>     >     Any advice how to track this down further? Is this a known
    >> issue?
    >>     >
    >>     >     Best,
    >>     >     Tobias
    >>     >
    >>     >
    >>     >
    >>     >
    >>     >
    >>
    >>
    >>
    >
    

Mime
View raw message