airflow-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chris Riccomini <criccom...@apache.org>
Subject Re: Possible Bug (?) in BigQueryOperator - Missing data when writing to a partitioned table
Date Wed, 27 Sep 2017 20:56:22 GMT
I am highly skeptical that it's the library.

On Wed, Sep 27, 2017 at 1:50 PM, Tobias Feldhaus <
Tobias.Feldhaus@localsearch.ch> wrote:

> This was exactly my point. Before I dig deeper I want to build a very
> minimum PythonOperator that uses the new library as I am currently
>  comparing apples with oranges (same query, same data, different
> libraries). Although it really puzzles me how a different library can yield
> different (read as: some is missing) data – when it’s job is just to
> execute a query and not pulling and transforming it.
>
>
> On 27.09.2017, 19:43, "Chris Riccomini" <criccomini@apache.org> wrote:
>
>     Interesting. Just saw:
>
>     https://github.com/google/google-api-python-client
>
>     > This client library is supported but in maintenance mode only. We are
>     fixing necessary bugs and adding essential features to ensure this
> library
>     continues to meet your needs for accessing Google APIs. Non-critical
> issues
>     will be closed. Any issue may be reopened if it is causing ongoing
> problems.
>
>     Looks like we might want to migrate at some point. It'll be a big
> change.
>     <https://github.com/google/google-api-python-client#about>
>
>     On Wed, Sep 27, 2017 at 10:41 AM, Chris Riccomini <
> criccomini@apache.org>
>     wrote:
>
>     > AFAIK, google-api-python-client is not in maintenance mode. In fact,
> I
>     > believe the idiomatic Python library (google-cloud-python) is built
> off of google-api-python-client,
>     > I believe. I have spoken with several Google cloud PMs who have
> pointed me
>     > at google-api-python-client as the canonical library to use, and the
> one
>     > that receives updates for new products first (before
> google-cloud-python).
>     >
>     > On Wed, Sep 27, 2017 at 10:34 AM, Tobias Feldhaus <
>     > Tobias.Feldhaus@localsearch.ch> wrote:
>     >
>     >> Sounds like a possible solution, however to avoid hitting this
> problem
>     >> I’ve deleted all the tables before rerunning stuff. I think it
> might have
>     >> to do with the library. Airflow uses google-api-python-client which
> is in
>     >> maintenance mode and Google suggests switching to
> google-cloud-python. I
>     >> will write a PythonOperator DAG tomorrow and will check DAG against
> DAG
>     >> then to see if the library could be the problem.
>     >>
>     >> On 27.09.2017, 19:15, "Chris Riccomini" <criccomini@apache.org>
> wrote:
>     >>
>     >>     Is it possible that you were getting a cache hit with the BQ
> operator?
>     >>
>     >>     https://cloud.google.com/bigquery/docs/cached-results#bigque
>     >> ry-query-cache-api
>     >>
>     >>     The operator does not currently expose this flag, and I
> couldn't find
>     >>     whether the cache defaults to on or off for insert-job API.
>     >>
>     >>     On Wed, Sep 27, 2017 at 9:41 AM, Tobias Feldhaus <
>     >>     Tobias.Feldhaus@localsearch.ch> wrote:
>     >>
>     >>     > I’ve created a table with only the missing value in the exact
> same
>     >>     > partition, and then it’s going through. Could it be that the
> volume
>     >> of the
>     >>     > data plays a role or the client libraries maybe?
>     >>     >
>     >>     > On 27.09.2017, 17:46, "Tobias Feldhaus" <
>     >> Tobias.Feldhaus@localsearch.ch>
>     >>     > wrote:
>     >>     >
>     >>     >     Hi,
>     >>     >
>     >>     >
>     >>     >     I am tracing a bug in one of our data pipelines and I
> narrowed
>     >> it down
>     >>     > to some small number of events not being in a table (using
> Airflow
>     >> 1.8.2).
>     >>     >     After running the query myself that airflow executed
>     >> interactively, I
>     >>     > saw the missing entry. When airflow executed the same query,
> and
>     >> writes the
>     >>     > results to a partitioned table in BQ it was missing in that
>     >> destination
>     >>     > table.
>     >>     >     I’ve tried different scenarios now several times and the
> only
>     >>     > explanation or difference I can come up with, is that airflow
>     >> _might_ be
>     >>     > that using partitioned tables is not fully supported or there
> is
>     >> some weird
>     >>     > bug in the bigquery-python implementation.
>     >>     >
>     >>     >     When deleting the table and recreating it and reloading
> the
>     >> complete
>     >>     > date with airflow the data is still missing. When reloading a
>     >> single day,
>     >>     > it is also missing. I’ve created a python script to execute
> the
>     >> exact same
>     >>     > query and it works as expected.
>     >>     >
>     >>     >     Any advice how to track this down further? Is this a known
>     >> issue?
>     >>     >
>     >>     >     Best,
>     >>     >     Tobias
>     >>     >
>     >>     >
>     >>     >
>     >>     >
>     >>     >
>     >>
>     >>
>     >>
>     >
>
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message