beam-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From OrielResearch Eila Arich-Landkof <e...@orielresearch.org>
Subject Re: Advice for piping many CSVs with different columns names to one bigQuery table
Date Thu, 27 Sep 2018 19:35:24 GMT
Thank you!
Probably around 50.

Best,
Eila

On Thu, Sep 27, 2018 at 1:23 AM Ankur Goenka <goenka@google.com> wrote:

> Hi Eila,
>
> That seems reasonable to me.
>
> Here is a reference on writing to BQ
> https://github.com/apache/beam/blob/1ffba44f7459307f5a134b8f4ea47ddc5ca8affc/sdks/python/apache_beam/examples/complete/game/leader_board.py#L326
>
> May I know how many distinct column are you expecting across all files?
>
>
> On Wed, Sep 26, 2018 at 8:06 PM OrielResearch Eila Arich-Landkof <
> eila@orielresearch.org> wrote:
>
>> Hi Ankur / users,
>>
>> I would like to make sure that the suggested pipeline can work for my
>> needs.
>>
>> So, additional clarification:
>> - The CSV files have few common and few different columns. Each CSV file
>> represent a sample measurements record.
>>
>> - When the CSVs merged together, I expect to have one table with combined
>> columns from all samples. Will your suggested pipeline allow that?
>>
>> - If I understand correctly the following pipeline:
>>
>> *Read files => Parse lines => Generate pCollections for each column =>
>> GroupBy column name => Write to BQ*
>>
>> *Read files:* will generate a pCollection from each CSV line (=file)
>> *Parse Lines*: will extract the column name and its matching value
>> (generate a tuple)
>> *Generate pCollection for each column:* will generate a pCollection from
>> the tuple
>> *GroupBy: *will merge each column name with all the relevant samples
>> values (does it need to know the column names to group by or will it
>> automatically use the tuple key? )
>> *WriteToBQ*: will NOT (is that true?) be able to write the values
>> matching the relevant sample. The samples that didnt have value for a
>> specific key will get values for other samples values that are available at
>> the column's collection record
>>
>> - If the above understanding correct, will extracting & merging the
>> headers from all 2.5M CSV files in advance to derived the schema and then
>> using an additional pipeline to populate the data to BQ with schema will be
>> the "right" approach?
>>
>> Please let me know if I miss something here / what your thoughts are
>> Many thanks,
>> Eila
>>
>> On Wed, Sep 26, 2018 at 12:04 PM OrielResearch Eila Arich-Landkof <
>> eila@orielresearch.org> wrote:
>>
>>> Hi Ankur,
>>>
>>> Thank you. Trying this approach now. Will let you know if I have any
>>> issue implementing it.
>>> Best,
>>> Eila
>>>
>>> On Tue, Sep 25, 2018 at 7:19 PM Ankur Goenka <goenka@google.com> wrote:
>>>
>>>> Hi Eila,
>>>>
>>>> If I understand correctly, the objective is to read a large number of
>>>> CSV files, each of which contains a single row with multiple columns.
>>>> Deduplicate the columns in the file and write them to BQ.
>>>> You are using pandas DF to deduplicate the columns for a small set of
>>>> files which might not work for large number of files.
>>>>
>>>> You can use beam groupBy to deduplicate the columns and write them to
>>>> bigquery. Beam is capable of reading and managing large number of files by
>>>> providing path to the directory containing those files.
>>>> So the approach would be ->
>>>> Read files => Parse lines => Generate pCollections for each column
=>
>>>> GroupBy column name => Write to BQ
>>>> For reference here is an example of reading file and doing groupby
>>>> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/examples/wordcount.py
>>>>
>>>> Note: I am not very familiar with BQ so can't think of any direct
>>>> approach to dump data to BQ.
>>>>
>>>> Thanks,
>>>> Ankur
>>>>
>>>>
>>>> On Tue, Sep 25, 2018 at 12:13 PM OrielResearch Eila Arich-Landkof <
>>>> eila@orielresearch.org> wrote:
>>>>
>>>>> Hello,
>>>>> I would like to write large number of CSV file to BQ where the headers
>>>>> from all of them is aggregated to one common headers. any advice is very
>>>>> appreciated.
>>>>>
>>>>> The details are:
>>>>> 1. 2.5M CSV files
>>>>> 2. Each CSV file: header of 50-60 columns
>>>>> 2. Each CSV file: one data row
>>>>>
>>>>> there are common columns between the CSV file but I dont know them in
>>>>> advance.I would like to have all the csv files in one bigQuery table.
>>>>>
>>>>> My current method:
>>>>> When it was smaller amount of files, I read the csv files and appended
>>>>> them to one pandas dataframe that was written to a file (total.csv).
>>>>> total.csv was the input to the beam pipeline.
>>>>>
>>>>> small CSVs => Pandas DF => total CSV => pCollection => Big
Query
>>>>>
>>>>> The challenge with that approach is that the pandas will require large
>>>>> memory in order to hold the 2.5M csv files before writing them to BQ.
>>>>>
>>>>> Is there a different way to pipe the CSVs to BQ? One option will be to
>>>>> split the CSVs to batchs and write them to different BQ tables or append
to
>>>>> one table.
>>>>>
>>>>> Any thoughts how to do it without extra coding?
>>>>>
>>>>> Many thanks,
>>>>> --
>>>>> Eila
>>>>> www.orielresearch.org
>>>>> https://www.meetu
>>>>> <https://www.meetup.com/Deep-Learning-In-Production/>p.co
>>>>> <https://www.meetup.com/Deep-Learning-In-Production/>
>>>>> m/Deep-Learning-In-Production/
>>>>> <https://www.meetup.com/Deep-Learning-In-Production/>
>>>>>
>>>>>
>>>>>
>>>
>>> --
>>> Eila
>>> www.orielresearch.org
>>> https://www.meetu <https://www.meetup.com/Deep-Learning-In-Production/>
>>> p.co <https://www.meetup.com/Deep-Learning-In-Production/>
>>> m/Deep-Learning-In-Production/
>>> <https://www.meetup.com/Deep-Learning-In-Production/>
>>>
>>>
>>>
>>
>> --
>> Eila
>> www.orielresearch.org
>> https://www.meetu <https://www.meetup.com/Deep-Learning-In-Production/>
>> p.co <https://www.meetup.com/Deep-Learning-In-Production/>
>> m/Deep-Learning-In-Production/
>> <https://www.meetup.com/Deep-Learning-In-Production/>
>>
>>
>>

-- 
Eila
www.orielresearch.org
https://www.meetu <https://www.meetup.com/Deep-Learning-In-Production/>p.co
<https://www.meetup.com/Deep-Learning-In-Production/>
m/Deep-Learning-In-Production/
<https://www.meetup.com/Deep-Learning-In-Production/>

Mime
View raw message