beam-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ankur Goenka <goe...@google.com>
Subject Re: Advice for piping many CSVs with different columns names to one bigQuery table
Date Thu, 27 Sep 2018 05:22:41 GMT
Hi Eila,

That seems reasonable to me.

Here is a reference on writing to BQ
https://github.com/apache/beam/blob/1ffba44f7459307f5a134b8f4ea47ddc5ca8affc/sdks/python/apache_beam/examples/complete/game/leader_board.py#L326

May I know how many distinct column are you expecting across all files?


On Wed, Sep 26, 2018 at 8:06 PM OrielResearch Eila Arich-Landkof <
eila@orielresearch.org> wrote:

> Hi Ankur / users,
>
> I would like to make sure that the suggested pipeline can work for my
> needs.
>
> So, additional clarification:
> - The CSV files have few common and few different columns. Each CSV file
> represent a sample measurements record.
>
> - When the CSVs merged together, I expect to have one table with combined
> columns from all samples. Will your suggested pipeline allow that?
>
> - If I understand correctly the following pipeline:
>
> *Read files => Parse lines => Generate pCollections for each column =>
> GroupBy column name => Write to BQ*
>
> *Read files:* will generate a pCollection from each CSV line (=file)
> *Parse Lines*: will extract the column name and its matching value
> (generate a tuple)
> *Generate pCollection for each column:* will generate a pCollection from
> the tuple
> *GroupBy: *will merge each column name with all the relevant samples
> values (does it need to know the column names to group by or will it
> automatically use the tuple key? )
> *WriteToBQ*: will NOT (is that true?) be able to write the values
> matching the relevant sample. The samples that didnt have value for a
> specific key will get values for other samples values that are available at
> the column's collection record
>
> - If the above understanding correct, will extracting & merging the
> headers from all 2.5M CSV files in advance to derived the schema and then
> using an additional pipeline to populate the data to BQ with schema will be
> the "right" approach?
>
> Please let me know if I miss something here / what your thoughts are
> Many thanks,
> Eila
>
> On Wed, Sep 26, 2018 at 12:04 PM OrielResearch Eila Arich-Landkof <
> eila@orielresearch.org> wrote:
>
>> Hi Ankur,
>>
>> Thank you. Trying this approach now. Will let you know if I have any
>> issue implementing it.
>> Best,
>> Eila
>>
>> On Tue, Sep 25, 2018 at 7:19 PM Ankur Goenka <goenka@google.com> wrote:
>>
>>> Hi Eila,
>>>
>>> If I understand correctly, the objective is to read a large number of
>>> CSV files, each of which contains a single row with multiple columns.
>>> Deduplicate the columns in the file and write them to BQ.
>>> You are using pandas DF to deduplicate the columns for a small set of
>>> files which might not work for large number of files.
>>>
>>> You can use beam groupBy to deduplicate the columns and write them to
>>> bigquery. Beam is capable of reading and managing large number of files by
>>> providing path to the directory containing those files.
>>> So the approach would be ->
>>> Read files => Parse lines => Generate pCollections for each column =>
>>> GroupBy column name => Write to BQ
>>> For reference here is an example of reading file and doing groupby
>>> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/examples/wordcount.py
>>>
>>> Note: I am not very familiar with BQ so can't think of any direct
>>> approach to dump data to BQ.
>>>
>>> Thanks,
>>> Ankur
>>>
>>>
>>> On Tue, Sep 25, 2018 at 12:13 PM OrielResearch Eila Arich-Landkof <
>>> eila@orielresearch.org> wrote:
>>>
>>>> Hello,
>>>> I would like to write large number of CSV file to BQ where the headers
>>>> from all of them is aggregated to one common headers. any advice is very
>>>> appreciated.
>>>>
>>>> The details are:
>>>> 1. 2.5M CSV files
>>>> 2. Each CSV file: header of 50-60 columns
>>>> 2. Each CSV file: one data row
>>>>
>>>> there are common columns between the CSV file but I dont know them in
>>>> advance.I would like to have all the csv files in one bigQuery table.
>>>>
>>>> My current method:
>>>> When it was smaller amount of files, I read the csv files and appended
>>>> them to one pandas dataframe that was written to a file (total.csv).
>>>> total.csv was the input to the beam pipeline.
>>>>
>>>> small CSVs => Pandas DF => total CSV => pCollection => Big Query
>>>>
>>>> The challenge with that approach is that the pandas will require large
>>>> memory in order to hold the 2.5M csv files before writing them to BQ.
>>>>
>>>> Is there a different way to pipe the CSVs to BQ? One option will be to
>>>> split the CSVs to batchs and write them to different BQ tables or append
to
>>>> one table.
>>>>
>>>> Any thoughts how to do it without extra coding?
>>>>
>>>> Many thanks,
>>>> --
>>>> Eila
>>>> www.orielresearch.org
>>>> https://www.meetu <https://www.meetup.com/Deep-Learning-In-Production/>
>>>> p.co <https://www.meetup.com/Deep-Learning-In-Production/>
>>>> m/Deep-Learning-In-Production/
>>>> <https://www.meetup.com/Deep-Learning-In-Production/>
>>>>
>>>>
>>>>
>>
>> --
>> Eila
>> www.orielresearch.org
>> https://www.meetu <https://www.meetup.com/Deep-Learning-In-Production/>
>> p.co <https://www.meetup.com/Deep-Learning-In-Production/>
>> m/Deep-Learning-In-Production/
>> <https://www.meetup.com/Deep-Learning-In-Production/>
>>
>>
>>
>
> --
> Eila
> www.orielresearch.org
> https://www.meetu <https://www.meetup.com/Deep-Learning-In-Production/>
> p.co <https://www.meetup.com/Deep-Learning-In-Production/>
> m/Deep-Learning-In-Production/
> <https://www.meetup.com/Deep-Learning-In-Production/>
>
>
>

Mime
View raw message