spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "David Courtinot (JIRA)" <j...@apache.org>
Subject [jira] [Created] (SPARK-23520) Add support for MapType fields in JSON schema inference
Date Tue, 27 Feb 2018 00:11:00 GMT
David Courtinot created SPARK-23520:
---------------------------------------

             Summary: Add support for MapType fields in JSON schema inference
                 Key: SPARK-23520
                 URL: https://issues.apache.org/jira/browse/SPARK-23520
             Project: Spark
          Issue Type: Improvement
          Components: Spark Core, SQL
    Affects Versions: 2.2.1
            Reporter: David Courtinot


_InferSchema_ currently does not support inferring _MapType_ fields from JSON data, and for
a good reason: they are indistinguishable from structs in JSON format. In issue [SPARK-23494|https://issues.apache.org/jira/browse/SPARK-23494],
I proposed to expose some methods of _InferSchema_ to users so that they can build on top
of the inference primitives defined by this class. In this issue, I'm proposing to add more
control to the user by letting them specify a set of fields that should be forced as _MapType._

*Use-case*

Some JSON datasets contain high-cardinality fields, namely fields which key space is very
large. These fields shouldn't be interpreted as _StructType_ for the following reasons:
 * it's not really what they are. The key space as well as the value space may both be infinite,
so what best defines the schema of this data is the type of the keys and the type of the values,
not a struct containing all possible key-value pairs.
 * interpreting high-cardinality fields as structs can lead to enormous schemata that don't
even fit into memory.

*Proposition*

We would add a public overloaded signature for _InferSchema.inferField_ which allows to pass
a set of field accessors (a class that supports representing the access to any JSON field,
including nested ones) for which we wan't do not want to recurse and instead force a schema.
That would allow, in particular, to ask that a few fields be inferred as maps rather than
structs.

I am very open to discuss this with people who are more well-versed in the Spark codebase
than me, because I realize my proposition can feel somewhat patchy. I'll be more than happy
to provide some development effort if we manage to sketch a reasonably easy solution.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message