beam-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From OrielResearch Eila Arich-Landkof <e...@orielresearch.org>
Subject Re: Advice for piping many CSVs with different columns names to one bigQuery table
Date Thu, 27 Sep 2018 03:05:54 GMT
Hi Ankur / users,

I would like to make sure that the suggested pipeline can work for my
needs.

So, additional clarification:
- The CSV files have few common and few different columns. Each CSV file
represent a sample measurements record.

- When the CSVs merged together, I expect to have one table with combined
columns from all samples. Will your suggested pipeline allow that?

- If I understand correctly the following pipeline:

*Read files => Parse lines => Generate pCollections for each column =>
GroupBy column name => Write to BQ*

*Read files:* will generate a pCollection from each CSV line (=file)
*Parse Lines*: will extract the column name and its matching value
(generate a tuple)
*Generate pCollection for each column:* will generate a pCollection from
the tuple
*GroupBy: *will merge each column name with all the relevant samples values
(does it need to know the column names to group by or will it automatically
use the tuple key? )
*WriteToBQ*: will NOT (is that true?) be able to write the values matching
the relevant sample. The samples that didnt have value for a specific key
will get values for other samples values that are available at the column's
collection record

- If the above understanding correct, will extracting & merging the headers
from all 2.5M CSV files in advance to derived the schema and then using an
additional pipeline to populate the data to BQ with schema will be the
"right" approach?

Please let me know if I miss something here / what your thoughts are
Many thanks,
Eila

On Wed, Sep 26, 2018 at 12:04 PM OrielResearch Eila Arich-Landkof <
eila@orielresearch.org> wrote:

> Hi Ankur,
>
> Thank you. Trying this approach now. Will let you know if I have any issue
> implementing it.
> Best,
> Eila
>
> On Tue, Sep 25, 2018 at 7:19 PM Ankur Goenka <goenka@google.com> wrote:
>
>> Hi Eila,
>>
>> If I understand correctly, the objective is to read a large number of CSV
>> files, each of which contains a single row with multiple columns.
>> Deduplicate the columns in the file and write them to BQ.
>> You are using pandas DF to deduplicate the columns for a small set of
>> files which might not work for large number of files.
>>
>> You can use beam groupBy to deduplicate the columns and write them to
>> bigquery. Beam is capable of reading and managing large number of files by
>> providing path to the directory containing those files.
>> So the approach would be ->
>> Read files => Parse lines => Generate pCollections for each column =>
>> GroupBy column name => Write to BQ
>> For reference here is an example of reading file and doing groupby
>> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/examples/wordcount.py
>>
>> Note: I am not very familiar with BQ so can't think of any direct
>> approach to dump data to BQ.
>>
>> Thanks,
>> Ankur
>>
>>
>> On Tue, Sep 25, 2018 at 12:13 PM OrielResearch Eila Arich-Landkof <
>> eila@orielresearch.org> wrote:
>>
>>> Hello,
>>> I would like to write large number of CSV file to BQ where the headers
>>> from all of them is aggregated to one common headers. any advice is very
>>> appreciated.
>>>
>>> The details are:
>>> 1. 2.5M CSV files
>>> 2. Each CSV file: header of 50-60 columns
>>> 2. Each CSV file: one data row
>>>
>>> there are common columns between the CSV file but I dont know them in
>>> advance.I would like to have all the csv files in one bigQuery table.
>>>
>>> My current method:
>>> When it was smaller amount of files, I read the csv files and appended
>>> them to one pandas dataframe that was written to a file (total.csv).
>>> total.csv was the input to the beam pipeline.
>>>
>>> small CSVs => Pandas DF => total CSV => pCollection => Big Query
>>>
>>> The challenge with that approach is that the pandas will require large
>>> memory in order to hold the 2.5M csv files before writing them to BQ.
>>>
>>> Is there a different way to pipe the CSVs to BQ? One option will be to
>>> split the CSVs to batchs and write them to different BQ tables or append to
>>> one table.
>>>
>>> Any thoughts how to do it without extra coding?
>>>
>>> Many thanks,
>>> --
>>> Eila
>>> www.orielresearch.org
>>> https://www.meetu <https://www.meetup.com/Deep-Learning-In-Production/>
>>> p.co <https://www.meetup.com/Deep-Learning-In-Production/>
>>> m/Deep-Learning-In-Production/
>>> <https://www.meetup.com/Deep-Learning-In-Production/>
>>>
>>>
>>>
>
> --
> Eila
> www.orielresearch.org
> https://www.meetu <https://www.meetup.com/Deep-Learning-In-Production/>
> p.co <https://www.meetup.com/Deep-Learning-In-Production/>
> m/Deep-Learning-In-Production/
> <https://www.meetup.com/Deep-Learning-In-Production/>
>
>
>

-- 
Eila
www.orielresearch.org
https://www.meetu <https://www.meetup.com/Deep-Learning-In-Production/>p.co
<https://www.meetup.com/Deep-Learning-In-Production/>
m/Deep-Learning-In-Production/
<https://www.meetup.com/Deep-Learning-In-Production/>

Mime
View raw message