beam-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ahmet Altay (JIRA)" <>
Subject [jira] [Commented] (BEAM-1787) Python DirectRunner silently blocks reading full query from Google Datastore
Date Thu, 23 Mar 2017 06:14:41 GMT


Ahmet Altay commented on BEAM-1787:

Mike, is it possible that there is a GroupByKey state in step 3 of the above reproduction.
DirectRunner processes bundles, and waits for all upstream data to be available for GroupByKey,
that could be what you are noticing.

> Python DirectRunner silently blocks reading full query from Google Datastore
> ----------------------------------------------------------------------------
>                 Key: BEAM-1787
>                 URL:
>             Project: Beam
>          Issue Type: Bug
>          Components: sdk-py
>            Reporter: Mike Lambert
>            Assignee: Ahmet Altay
>            Priority: Minor
>              Labels: datastore, python
> When I run a query (even with many splits) against the production datastore (such as
in the datastore_wordcount demo), it operates as follows:
> 1. split the query into a bunch of split queries
> 2. run each split query, collecting the results
> 3. then pass the results to the following stage / ParDo
> However, 2 is run to completion with DirectRunner before starting 3. So a large dataset
must be fully downloaded before it attempts to run any of the following stages.
> While it may make sense and local parallelism/pipelining might be impossible....there
is no output or status messages. And debugging why my code appeared to hang before processing
results, took forever to dig through code and instrument-log-debug all the beam code to figure
out what was going on.
> See for more details
> This happens with github head 0.7.0-dev (there was no "version" tag for this above).

This message was sent by Atlassian JIRA

View raw message