beam-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Anton Kedin <ke...@google.com>
Subject Complex Types Support for Beam SQL DDL
Date Fri, 04 May 2018 18:22:55 GMT
Hi,

I am working on adding support for non-primitive types in Beam SQL DDL.

*Goal*
Allow users to define tables with Rows, Arrays, Maps as field types in DDL.
This enables defining schemas for complex sources, e.g. describing JSON
sources or other sources which support complex field types (BQ, etc).

*Solution*
Extend the parser we have in Beam SQLto accept the following DDL statement:
"CREATE TABLE tableName (field_name <COMPLEX_FIELD_TYPE>)" where
"<COMPLEX_FIELD_TYPE>" can be any the following:

   - "primitiveType ARRAY", for example, "field_int_arr" INTEGER ARRAY".
   Thoughts:
   - this is how SQL standard defines ARRAY field declaration;
      - existing parser supports similar syntax for collections;
      - hard to read for nested collections;
      - similar syntax is supported in Postgres
      <https://www.postgresql.org/docs/9.1/static/arrays.html>;
   - "ARRAY<type>", for example "field_matrix ARRAY<ARRAY<INTEGER>>".
   Thoughts:
   - easy to read and support arbitrary nesting;
      - similar syntax is implemented in:
         - BigQuery
         <https://cloud.google.com/bigquery/docs/data-definition-language#column_name_and_column_schema>
         ;
         - Spanner
         <https://cloud.google.com/spanner/docs/data-definition-language#arrays>
         ;
         - KSQL
         <https://docs.confluent.io/current/ksql/docs/syntax-reference.html#create-table>
         ;
         - Spark/Hive
         <https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-CreateTable>
         ;
      - "MAP<primitiveType, type>", for example "MAP<VARCHAR, MAP<INTEGER,
   VARCHAR>>". Thoughts:
   - there doesn't seem to be a SQL standard support for maps;
      - looks similar to the "ARRAY<type>" definition;
      - similar syntax is implemented in:
         - KSQL
         <https://docs.confluent.io/current/ksql/docs/syntax-reference.html#create-table>
         ;
         - Spark/Hive
         <https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-CreateTable>
         ;
      - "ROW(fieldList)", for example "row_field ROW(f_int INTEGER, f_str
   VARCHAR)". Thoughts:
   - SQL standard defines the syntax this way;
      - don't know where similar syntax is implemented;
   - "ROW<fieldList>", for example "row_field ROW<f_int INTEGER, f_str
   VARCHAR>". Thoughts:
   - ROW is not supported in a lot of dialects, but STRUCT is similar and
      supported in few dialects;
      - similar syntax for STRUCT is implemented in:
         - BigQuery
         <https://cloud.google.com/bigquery/docs/data-definition-language>;
         - Spark/Hive
         <https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-CreateTable>
         ;

Questions/comments?
Pull Request <https://github.com/apache/beam/pull/5276>

Thank you,
Anton

Mime
View raw message