beam-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Eugene Kirpichov <kirpic...@google.com>
Subject Re: Scaling dataflow python SDK 2.0.0
Date Mon, 05 Jun 2017 19:42:20 GMT
Do you have a Dataflow job ID to look at?
It might be due to fusion
https://cloud.google.com/dataflow/service/dataflow-service-desc#preventing-fusion

On Mon, Jun 5, 2017 at 12:13 PM Prabeesh K. <prabsmails@gmail.com> wrote:

> Please try using *--worker_machine_type* n1-standard-4 or n1-standard-8
>
> On 5 June 2017 at 23:08, Morand, Sebastien <sebastien.morand@veolia.com>
> wrote:
>
>> I do have a problem with my tries to test scaling on dataflow.
>>
>> My dataflow is pretty simple: I get a list of files from pubsub, so the
>> number of files I'm going to use to feed the flow is well known at the
>> begining. Here are my steps:
>> Let's say I have 200 files containing about 20,000,000 of records
>>
>>    - *First Step:* Read file contents from storage: files are .tar.gz
>>    containing each 4 files (CSV). I return the file content as the whole in a
>>    record
>>    *OUT:* 200 records (one for each file containing the data of all 4
>>    files). Bascillacy it's a dict : {file1: content_of_file1, file2:
>>    content_of_file2, etc...}
>>
>>    - *Second step:*  Joining the data of the 4 files in one record (the
>>    main file contains foreign key to get information from the other files)
>>    *OUT:* 20,000,000 records each for every line in the files. Each
>>    record is a list of string
>>
>>    - *Third step:* cleaning data (convert to prepare integration in
>>    bigquery) and set them as a dict where keys are bigquery column name.
>>    *OUT:* 20,000,000 records as dict for each record
>>
>>    - *Fourth step:* insert into bigquery
>>
>> So the first step return 200 records, but I have 20,000,000 records to
>> insert.
>> This takes about 1 hour and half and always use 1 worker ...
>>
>> If I manually set the number of workers, it's not really faster. So for
>> an unknow reason, it doesn't scale, any ideas how to do it?
>>
>> Thanks for any help.
>>
>> *S├ębastien MORAND*
>> Team Lead Solution Architect
>> Technology & Operations / Digital Factory
>> Veolia - Group Information Systems & Technology (IS&T)
>> Cell.: +33 7 52 66 20 81 / Direct: +33 1 85 57 71 08
>> <+33%201%2085%2057%2071%2008>
>> Bureau 0144C (Ouest)
>> 30, rue Madeleine-Vionnet - 93300 Aubervilliers, France
>> *www.veolia.com <http://www.veolia.com>*
>> <http://www.veolia.com>
>> <https://www.facebook.com/veoliaenvironment/>
>> <https://www.youtube.com/user/veoliaenvironnement>
>> <https://www.linkedin.com/company/veolia-environnement>
>> <https://twitter.com/veolia>
>>
>>
>>
>> --------------------------------------------------------------------------------------------
>> This e-mail transmission (message and any attached files) may contain
>> information that is proprietary, privileged and/or confidential to Veolia
>> Environnement and/or its affiliates and is intended exclusively for the
>> person(s) to whom it is addressed. If you are not the intended recipient,
>> please notify the sender by return e-mail and delete all copies of this
>> e-mail, including all attachments. Unless expressly authorized, any use,
>> disclosure, publication, retransmission or dissemination of this e-mail
>> and/or of its attachments is strictly prohibited.
>>
>> Ce message electronique et ses fichiers attaches sont strictement
>> confidentiels et peuvent contenir des elements dont Veolia Environnement
>> et/ou l'une de ses entites affiliees sont proprietaires. Ils sont donc
>> destines a l'usage de leurs seuls destinataires. Si vous avez recu ce
>> message par erreur, merci de le retourner a son emetteur et de le detruire
>> ainsi que toutes les pieces attachees. L'utilisation, la divulgation, la
>> publication, la distribution, ou la reproduction non expressement
>> autorisees de ce message et de ses pieces attachees sont interdites.
>>
>> --------------------------------------------------------------------------------------------
>>
>
>

Mime
View raw message