beam-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ryan Skraba (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (BEAM-2993) AvroIO.write without specifying a schema
Date Wed, 04 Oct 2017 10:13:00 GMT

    [ https://issues.apache.org/jira/browse/BEAM-2993?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16191085#comment-16191085
] 

Ryan Skraba commented on BEAM-2993:
-----------------------------------

Hello -- I'm chiming in to help clarify our use case, which is a bit specialized.  However,
if it's useful for us, it's potentially useful to others!

As part of our work using Beam, we help users assemble pipelines to run using configured "components".
 These are eventually translated to PTransforms, of course, acting on PCollections -- nothing
surprising!   We've picked Avro IndexedRecords (not GenericRecords, but that's a detail for
the moment) as the common currency between the PTransforms.  This works well, especially if
you know every schema on every collection at design-time, when you're building your pipeline.
  

[~jkff] is correct that *if* the schema is known  *and* you are using {{AvroCoder}}, you already
have the schema in your hands when you build the {{AvroIO.write}} and all is well.

We have some advanced functionality, however, where we deduce the schema at runtime -- either
at the start of the pipeline (such as reading from a JDBC table and converting the row + table
metadata into a consistent IndexedRecord) but also in the middle of a pipeline (we can infer
a schema after some user-defined processing).  In brief, we can't directly use AvroCoder in
this case, but we can write our own {{Coder<IndexedRecord>}} that takes care of sharing
the schema between interested nodes when necessary (not with every record).

In this case, we've managed to create a {{PCollection<IndexedRecord>}} while designing
the Pipeline, using our Coder that doesn't require the schema, but we still can't attach it
to the {{AvroIO.write}}...

Note that "sharing the schema between interested nodes" in our custom coder introduces a distributed
state between nodes, which is not the ideal for parallelization -- in this case, we've measured
the cost to be acceptable since it only occurs the first time a node tries to write or read
an avro-encoded record from the collection.

That's a long detour to explain why {{AvroIO.write}} without schema would be interesting to
us, but I hope you find it useful.   Our technique for sharing the schema as distributed state
in the Coder is a much larger view but I'm very sure we'd be interested in contributing!

> AvroIO.write without specifying a schema
> ----------------------------------------
>
>                 Key: BEAM-2993
>                 URL: https://issues.apache.org/jira/browse/BEAM-2993
>             Project: Beam
>          Issue Type: Improvement
>          Components: sdk-java-extensions
>            Reporter: Etienne Chauchot
>            Assignee: Etienne Chauchot
>
> Similarly to https://issues.apache.org/jira/browse/BEAM-2677, we should be able to write
to avro files using {{AvroIO}} without specifying a schema at build time. Consider the following
use case: a user has a {{PCollection<GenericRecord>}}  but the schema is only known
while running the pipeline.  {{AvroIO.writeGenericRecords}} needs the schema, but the schema
is already available in {{GenericRecord}}. We should be able to call {{AvroIO.writeGenericRecords()}}
with no schema.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Mime
View raw message