beam-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Joe Cullen <joe.m.cullen1...@gmail.com>
Subject Re: Inferring Csv Schemas
Date Fri, 30 Nov 2018 14:01:31 GMT
Thanks Reza, that's really helpful!

I have a few questions:

"He used a GroupByKey function on the JSON type and then a manual check on
the JSON schema against the known good BigQuery schema. If there was a
difference, the schema would mutate and the updates would be pushed
through."

If the difference was a new column had been added to the JSON elements,
does there need to be any mutation? The JSON schema derived from the JSON
elements would already have this new column, and if BigQuery allows for
additive schema changes then this new JSON schema should be fine, right?

But then I'm not sure how you would enter the 'failed inserts' section of
the pipeline (as the insert should have been successful).

Have I misunderstood what is being mutated?

Thanks,
Joe

On Fri, 30 Nov 2018, 11:07 Reza Ardeshir Rokni <rarokni@gmail.com wrote:

> Hi Joe,
>
> You may find some of the info in this blog of interest, its based on
> streaming pipelines but useful ideas.
>
>
> https://cloud.google.com/blog/products/gcp/how-to-handle-mutating-json-schemas-in-a-streaming-pipeline-with-square-enix
>
> Cheers
>
> Reza
>
> On Thu, 29 Nov 2018 at 06:53, Joe Cullen <joe.m.cullen1990@gmail.com>
> wrote:
>
>> Hi all,
>>
>> I have a pipeline reading CSV files, performing some transforms, and
>> writing to BigQuery. At the moment I'm reading the BigQuery schema from a
>> separate JSON file. If the CSV files had a new column added (and I wanted
>> to include this column in the resultant BigQuery table), I'd have to change
>> the JSON schema or the pipeline itself. Is there any way to autodetect the
>> schema using BigQueryIO? How do people normally deal with potential changes
>> to input CSVs?
>>
>> Thanks,
>> Joe
>>
>

Mime
View raw message