beam-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Andrea Pierleoni (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (BEAM-2595) WriteToBigQuery does not work with nested json schema
Date Wed, 12 Jul 2017 06:11:00 GMT

    [ https://issues.apache.org/jira/browse/BEAM-2595?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16083496#comment-16083496
] 

Andrea Pierleoni commented on BEAM-2595:
----------------------------------------

yes, sorry forgot the stack trace

{code:}
Traceback (most recent call last):
  File "/Users/andreap/work/code/library_dataflow/main.py", line 347, in <module>
    run()
  File "/Users/andreap/work/code/library_dataflow/main.py", line 343, in run
    write_disposition=beam.io.BigQueryDisposition.WRITE_TRUNCATE)
  File "/Users/andreap/library_dataflow/venv/lib/python2.7/site-packages/apache_beam/pvalue.py",
line 100, in __or__
    return self.pipeline.apply(ptransform, self)
  File "/Users/andreap/library_dataflow/venv/lib/python2.7/site-packages/apache_beam/pipeline.py",
line 265, in apply
    pvalueish_result = self.runner.apply(transform, pvalueish)
  File "/Users/andreap/library_dataflow/venv/lib/python2.7/site-packages/apache_beam/runners/runner.py",
line 150, in apply
    return m(transform, input)
  File "/Users/andreap/library_dataflow/venv/lib/python2.7/site-packages/apache_beam/runners/runner.py",
line 156, in apply_PTransform
    return transform.expand(input)
  File "/Users/andreap/work/code/library_dataflow/beam2_1.py", line 213, in expand
    return pcoll | 'WriteToBigQuery' >> ParDo(bigquery_write_fn)
  File "/Users/andreap/library_dataflow/venv/lib/python2.7/site-packages/apache_beam/transforms/core.py",
line 620, in __init__
    super(ParDo, self).__init__(fn, *args, **kwargs)
  File "/Users/andreap/library_dataflow/venv/lib/python2.7/site-packages/apache_beam/transforms/ptransform.py",
line 515, in __init__
    self.fn = pickler.loads(pickler.dumps(self.fn))
  File "/Users/andreap/library_dataflow/venv/lib/python2.7/site-packages/apache_beam/internal/pickler.py",
line 225, in loads
    return dill.loads(s)
  File "/Users/andreap/library_dataflow/venv/lib/python2.7/site-packages/dill/dill.py", line
277, in loads
    return load(file)
  File "/Users/andreap/library_dataflow/venv/lib/python2.7/site-packages/dill/dill.py", line
266, in load
    obj = pik.load()
  File "/usr/local/Cellar/python/2.7.13/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py",
line 864, in load
    dispatch[key](self)
  File "/usr/local/Cellar/python/2.7.13/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py",
line 1195, in load_appends
    list.extend(stack[mark + 1:])
  File "/Users/andreap/library_dataflow/venv/lib/python2.7/site-packages/apitools/base/protorpclite/messages.py",
line 1147, in extend
    self.__field.validate(sequence)
AttributeError: 'FieldList' object has no attribute '_FieldList__field'

{code}

for some reasons once deserialized the seuquence does not have the '_FieldList__field' attribute.
I don't believe it is introduced in 2.1.0 per se, but a (very useful) class using it is coming
with 2.1.0 and it may flag as a real problem in production. Definitely for us.

> WriteToBigQuery does not work with nested json schema
> -----------------------------------------------------
>
>                 Key: BEAM-2595
>                 URL: https://issues.apache.org/jira/browse/BEAM-2595
>             Project: Beam
>          Issue Type: Bug
>          Components: sdk-py
>    Affects Versions: 2.1.0
>         Environment: mac os local runner, Python
>            Reporter: Andrea Pierleoni
>            Assignee: Sourabh Bajaj
>            Priority: Minor
>              Labels: gcp
>             Fix For: 2.1.0
>
>
> I am trying to use the new `WriteToBigQuery` PTransform added to `apache_beam.io.gcp.bigquery`
in version 2.1.0-RC1
> I need to write to a bigquery table with nested fields.
> The only way to specify nested schemas in bigquery is with teh json schema.
> None of the classes in `apache_beam.io.gcp.bigquery` are able to parse the json schema,
but they accept a schema as an instance of the class `apache_beam.io.gcp.internal.clients.bigquery.TableFieldSchema`
> I am composing the `TableFieldSchema` as suggested here [https://stackoverflow.com/questions/36127537/json-table-schema-to-bigquery-tableschema-for-bigquerysink/45039436#45039436],
and it looks fine when passed to the PTransform `WriteToBigQuery`. 
> The problem is that the base class `PTransformWithSideInputs` try to pickle and unpickle
the function [https://github.com/apache/beam/blob/master/sdks/python/apache_beam/transforms/ptransform.py#L515]
 (that includes the TableFieldSchema instance) and for some reason when the class is unpickled
some `FieldList` instance are converted to simple lists, and the pickling validation fails.
> Would it be possible to extend the test coverage to nested json objects for bigquery?
> They are also relatively easy to parse into a TableFieldSchema.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Mime
View raw message