spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Yin Huai (JIRA)" <>
Subject [jira] [Commented] (SPARK-2789) Apply names to RDD to becoming SchemaRDD
Date Fri, 01 Aug 2014 21:01:39 GMT


Yin Huai commented on SPARK-2789:

We need to be careful on handling reserved characters of this String representation. I think
we also want to support quoted identifiers (using backticks). I am attaching a related doc
from Postgresql ( 

> Apply names to RDD to becoming SchemaRDD
> ----------------------------------------
>                 Key: SPARK-2789
>                 URL:
>             Project: Spark
>          Issue Type: New Feature
>          Components: SQL
>            Reporter: Davies Liu
> In order to simplify apply schema, we could add an API called applyNames(), which will
infer the types in the RDD and create an schema with names, then apply  this schema on it
to becoming a SchemaRDD. The names could be provides by String with names separated  by space.
> For example:
> rdd = sc.parallelize([("Alice", 10)])
> srdd = sqlCtx.applyNames(rdd, "name age")
> User don't need to create an case class or StructType to have all power of Spark SQL.
> The string presentation of schema also could support nested structure (MapType, ArrayType
and StructType), for example:
> "name age address(city zip) likes[title stars] props{[value type]}"
> It will equal to unnamed schema:
> root
> |--name
> |--age
> |--address
> |--|--city
> |--|--zip
> |--likes
> |--|--element
> |--|--|--title
> |--|--|--starts
> |--props
> |--|--key:
> |--|--value:
> |--|--|--element
> |--|--|--|--value
> |--|--|--|--type
> All the names of fields are seperated by space, the struct of field (if it is nested
type) follows the name without space, wich shoud startswith "(" (StructType) or "[" (ArrayType)
or "{" (MapType).

This message was sent by Atlassian JIRA

View raw message