beam-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Eugene Kirpichov (JIRA)" <>
Subject [jira] [Commented] (BEAM-2993) AvroIO.write without specifying a schema
Date Wed, 04 Oct 2017 22:06:00 GMT


Eugene Kirpichov commented on BEAM-2993:

OK, thanks for the explanations. A couple more questions:

- Does AvroIO.write().to(DynamicDestinations) work for you? It seems like what you have is
a very specialized use case (I've never seen nor imagined anything like it), so if an existing
solution does the job, then it might be best to just use that rather than develop a new feature
guided only by a single very exotic use case.
- Suppose a schemaless AvroIO.write() was implemented, and suppose you give it a PCollection<GenericRecord>
that happens to contain records with many different schemas. What should it do? Should it
group them by schema? Should it simply fail? Should it use the schema of a (non-deterministically
chosen) "first" record in each generated file and hope that other records have the same schema?
- Would it make things easier, if instead of PCollection<IndexedRecord> you operated
in terms of PCollection<SchemaRefAndRecord> where SchemaRefAndRecord is your custom
type { String schemaURI; GenericRecord record; }, with a custom coder for it that fetches
the schema over the network from a schema registry by URI or something? And then when writing
to AvroIO, you'd go down the path of DynamicDestinations and group by schemaURI before writing
(i.e. use it as a destination type); and it would be up to your code to ensure that the schema
URIs are unique.

> AvroIO.write without specifying a schema
> ----------------------------------------
>                 Key: BEAM-2993
>                 URL:
>             Project: Beam
>          Issue Type: Improvement
>          Components: sdk-java-extensions
>            Reporter: Etienne Chauchot
>            Assignee: Etienne Chauchot
> Similarly to, we should be able to write
to avro files using {{AvroIO}} without specifying a schema at build time. Consider the following
use case: a user has a {{PCollection<GenericRecord>}}  but the schema is only known
while running the pipeline.  {{AvroIO.writeGenericRecords}} needs the schema, but the schema
is already available in {{GenericRecord}}. We should be able to call {{AvroIO.writeGenericRecords()}}
with no schema.

This message was sent by Atlassian JIRA

View raw message