beam-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "ASF GitHub Bot (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (BEAM-2993) AvroIO.write without specifying a schema
Date Thu, 05 Oct 2017 13:52:02 GMT

    [ https://issues.apache.org/jira/browse/BEAM-2993?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16192901#comment-16192901
] 

ASF GitHub Bot commented on BEAM-2993:
--------------------------------------

GitHub user echauchot opened a pull request:

    https://github.com/apache/beam/pull/3950

     [BEAM-2993] AvroIO.write without specifying a schema

    Follow this checklist to help us incorporate your contribution quickly and easily:
    
     - [X] Make sure there is a [JIRA issue](https://issues.apache.org/jira/projects/BEAM/issues/)
filed for the change (usually before you start working on it).  Trivial changes like typos
do not require a JIRA issue.  Your pull request should address just this issue, without pulling
in other changes.
     - [X] Each commit in the pull request should have a meaningful subject line and body.
     - [X] Format the pull request title like `[BEAM-XXX] Fixes bug in ApproximateQuantiles`,
where you replace `BEAM-XXX` with the appropriate JIRA issue.
     - [X] Write a pull request description that is detailed enough to understand what the
pull request does, how, and why.
     - [X] Run `mvn clean verify` to make sure basic checks pass. A more thorough check will
be performed on your pull request automatically.
     - [X] If this contribution is large, please file an Apache [Individual Contributor License
Agreement](https://www.apache.org/licenses/icla.pdf).
    
    ---
    This PR adds the ability to use `AvroIO.write()` and related methods without specifying
a schema. 
    The schema is determined at the first call of `AvroSink.write()`: the `DataFileWriter`
is lazy initialized (at first write) once we have the value to get the schema from.  
    This PR also makes the schema optional in `ConstantAvroDestination` and depreciate write
methods that take schema as parameter. Tell me if I'm missing something that prevents deprecation
of these methods.
    
    To use `AvoIO.write()` with no schema, all the elements of the input PCollection must
have the same schema, but it is the same with current AvroIO.write(schema) implementation
because this schema instance is passed to the `TypedWrite` then to the `ConstantAvroDestination`
that is used in `AvroSink`. Please tell me if I'm missing something here.
    
    My only concern is with empty bundles, `AvroSink.write()` will not be called resulting
in the `DataFileWriter` not being initialized.  
    
    Please merge the PR bellow before this one because it is used as a base for the tests
    https://github.com/apache/beam/pull/3948
    
    R: @jkff 
    R: @reuvenlax 
    CC: @lukecwik 


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/echauchot/beam AvroIOWriteSchemaLess2

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/beam/pull/3950.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #3950
    
----
commit 43ef4d42d7d224b1997278832ec645bccb945792
Author: Etienne Chauchot <echauchot@gmail.com>
Date:   2017-10-05T09:45:12Z

    [BEAM-3019] Make AvroIOWriteTransformTest more generic
    
    make runTestWrite() more generic to be able to use GenericRecord[] as input for writeGenericRecords
test in place of AvroGeneratedUser
    make readAvroFile() generic to be able to read GenericRecords using GenericDatumReader
for writeGenericRecords test

commit 84074e36085d76f569c89d4a29a647fc40b22531
Author: Etienne Chauchot <echauchot@gmail.com>
Date:   2017-10-02T15:08:55Z

    [BEAM-2993] AvroIO.write without specifying a schema
    
    Lazy init (at first write) of the dataFileWriter once we have the value to get the schema
from.
    Make schema optional in ConstantAvroDestination and depreciate write methods that take
schema as parameter
    Cleaning

commit d19c2cb3538e5981e8138522d0c2138b455dec46
Author: Etienne Chauchot <echauchot@gmail.com>
Date:   2017-10-05T12:08:04Z

    Add tests of the schema less write methods
    Cleaning

commit da95342353bd191c55d6d7768d4c052c531b8cf1
Author: Etienne Chauchot <echauchot@gmail.com>
Date:   2017-10-05T12:43:29Z

    Fixups

----


> AvroIO.write without specifying a schema
> ----------------------------------------
>
>                 Key: BEAM-2993
>                 URL: https://issues.apache.org/jira/browse/BEAM-2993
>             Project: Beam
>          Issue Type: Improvement
>          Components: sdk-java-extensions
>            Reporter: Etienne Chauchot
>            Assignee: Etienne Chauchot
>
> Similarly to https://issues.apache.org/jira/browse/BEAM-2677, we should be able to write
to avro files using {{AvroIO}} without specifying a schema at build time. Consider the following
use case: a user has a {{PCollection<GenericRecord>}}  but the schema is only known
while running the pipeline.  {{AvroIO.writeGenericRecords}} needs the schema, but the schema
is already available in {{GenericRecord}}. We should be able to call {{AvroIO.writeGenericRecords()}}
with no schema.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Mime
View raw message