beam-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Eugene Kirpichov <kirpic...@google.com>
Subject Re: Why beam.io.BigQuerySource() transform taking so long time to read a bigquery table to P-Collection
Date Wed, 17 Jan 2018 22:54:09 GMT
I'm looking at this job and it appears to apply hundreds of transforms
after reading from BigQuery inside the BQRead step, and that's what's
taking the time. If you suspect that the problem is still in pure reading
from BigQuery, can you simplify the pipeline so that it only uses
beam.io.BigQuerySource() and nothing else, so we can look at just the
performance of that?

On Tue, Jan 16, 2018 at 10:56 PM Unais Thachuparambil <
unais.thachuparambil@careem.com> wrote:

> This is my dataflow Job id  2018-01-15_03_45_25-16237456828769339983
>
> On Wed, Jan 17, 2018 at 12:10 AM, Eugene Kirpichov <kirpichov@google.com>
> wrote:
>
>> After the bigquery export job, Beam needs to also ingest the results of
>> the export (Avro files); Beam uses the standard Python avro library for
>> that, which has notoriously awful performance; taking 2 hours to read 44GB
>> of data would not be surprising. It's possible that this is the root of the
>> issue, but we'd need a job id to tell for sure. There are several forks of
>> the avro library with better performance but the forks are unmaintained.
>>
>> On Tue, Jan 16, 2018 at 11:53 AM Chamikara Jayalath <chamikara@google.com>
>> wrote:
>>
>>> BQ Read step involves running a query job in BQ followed by running a
>>> export job to export the resulting table to GCS. Is it possible that these
>>> jobs took a long time for some reason ?
>>>
>>> Dataflow job should log the BQ job IDs of these jobs and it should be
>>> possible to check the status using following command.
>>>
>>> bq show -j --project_id=<GCP project ID> <BQ job ID>
>>>
>>> Feel free to mention your job ID in Dataflow SDK's stackoverflow channel
>>> if you want Dataflow team to take a look.
>>> https://stackoverflow.com/questions/tagged/dataflow
>>>
>>> - Cham
>>>
>>>
>>>
>>> On Tue, Jan 16, 2018 at 1:15 AM Unais Thachuparambil <
>>> unais.thachuparambil@careem.com> wrote:
>>>
>>>>
>>>> I'm reading a date sharded table from Bigquery (180 days ~ 44.26GB)
>>>> using beam.io.BigQuerySource() by running a simple query
>>>>
>>>>  """
>>>> SELECT
>>>>   filed1,
>>>>   filed2,
>>>>   filed3,
>>>>   filed4,
>>>>   filed5,
>>>>   filed6
>>>> FROM
>>>>   TABLE_DATE_RANGE([dataset:table_name_], TIMESTAMP('{start_date}'),
>>>> TIMESTAMP('{end_date}'))
>>>> WHERE
>>>>   filed1 IS NOT NULL
>>>> """
>>>>
>>>> after that, I'm partitioning the source data based on field2 date and
>>>> converting to date partitioned P-Collections
>>>>
>>>> But while monitoring the data flow console I noticed that the BQRead
>>>> operation taking more than 1hr 40min out of 2hr: 54-minute total execution.
>>>>
>>>> Why the BQ io read taking a long time? Is there any implemented method
>>>> in data flow (I'm using python API) to speed up this process.
>>>>
>>>> How I can reduce the read io execution time?.
>>>>
>>>> Screenshot of graph is attached (Time showed on the graph is wrong - It
>>>> took 2hr 54-min to finish)
>>>>
>>>> [image: Screen Shot 2018-01-16 at 1.06.28 PM.png]
>>>> ‚Äč
>>>>
>>>
>

Mime
View raw message