spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Simeon Simeonov (JIRA)" <j...@apache.org>
Subject [jira] [Created] (SPARK-16483) Unifying struct fields and columns
Date Mon, 11 Jul 2016 19:58:11 GMT
Simeon Simeonov created SPARK-16483:
---------------------------------------

             Summary: Unifying struct fields and columns
                 Key: SPARK-16483
                 URL: https://issues.apache.org/jira/browse/SPARK-16483
             Project: Spark
          Issue Type: New Feature
          Components: SQL
            Reporter: Simeon Simeonov


This issue comes as a result of an exchange with Michael Armbrust outside of the usual JIRA/dev
list channels. 

DataFrame provides a full set of manipulation operations for top-level columns. They have
be added, removed, modified and renamed. The same is not true about fields inside structs
yet, from a logical standpoint, Spark users may very well want to perform the same operations
on struct fields, especially since automatic schema discovery from JSON input tends to create
deeply nested structs.

Common use-cases include:

- Remove and/or rename struct field(s) to adjust the schema
- Fix a data quality issue with a struct field (update/rewrite)

To do this with the existing API by hand requires manually calling {{named_struct}} and listing
all fields, including ones we don't want to manipulate. This leads to complex, fragile code
that cannot survive schema evolution.

It would be far better if the various APIs that can now manipulate top-level columns were
extended to handle struct fields at arbitrary locations or, alternatively, if we introduced
new APIs for modifying any field in a dataframe, whether it is a top-level one or one nested
inside a struct.

Purely for discussion purposes, here is the skeleton implementation of an update() implicit
that we've use to modify any existing field in a dataframe. (Note that it depends on various
other utilities and implicits that are not included). https://gist.github.com/ssimeonov/f98dcfa03cd067157fa08aaa688b0f66



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message