spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "David Courtinot (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (SPARK-23494) Expose InferSchema's functionalities to the outside
Date Fri, 23 Feb 2018 15:27:00 GMT

     [ https://issues.apache.org/jira/browse/SPARK-23494?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

David Courtinot updated SPARK-23494:
------------------------------------
    Description: 
I'm proposing that InferSchema's internals (infer the schema of each record, merge two schemata,
and canonicalize the result) be exposed to the outside.

*Use-case*

My team continuously produces large amounts of JSON data. The schema is and must be very dynamic:
fields can appear and go from one day to another, most fields are nullable, some fields have
small frequency etc.

In another job, we donwload this data, sample it, infer the schema using Dataset.schema().
From there, we output the data in Parquet for later querying. This approach has proved problematic:
 *  rare fields can be absent from a sample, and therefore absent from the schema. This results
on exceptions when trying to query those fields. We have had to implement cumbersome fixes
for this involving a manually curated set of required fields.
 * this is expensive. Going through a sample of the data to infer the schema is still a very
costly operation for us. Caching the JSON RDD to disk (doesn't fit in memory) revealed to
be at least as slow as traversing the sample first, and the whole data next.

*Proposition*

InferSchema is essentially a fold operator. This means a Spark accumulator can easily be built
on top of it in order to calculate a schema alongside an RDD calculation. In the above use-case,
it has two main advantages:
 * the schema is inferred on the entire data, therefore contains all possible fields
 * the computational overhead is negligible since it happens at the same time as writing the
data to an external store rather than by evaluating the RDD for the sole purpose of schema
inference.
 * after writing the manifest to an external store, we can load the JSON data in a Dataset
without ever paying the infer cost again (just the conversion from JSON to Row).

With such feature, users can decide to use their JSON (or whatever else) data as structured
data whenever they want to even though the actual schema may vary every ten minutes as long
as they record the schema of each portion of data.

  was:
I'm proposing that InferSchema's internals (infer the schema of each record, merge two schemata,
and canonicalize the result) to be exposed to the outside.

*Use-case*

My team continuously produces large amounts of JSON data. The schema is and must be very dynamic:
fields can appear and go from one day to another, most fields are nullable, some fields have
small frequency etc.

We then consume this data, sample it, infer the schema using Dataset.schema(). From there,
we output the data in Parquet for later querying. This approach has proved problematic:
 *  rare fields can be absent from a sample, and therefore absent from the schema. This results
on exceptions when trying to query those fields. We have had to implement cumbersome fixes
for this involving a manually curated set of required fields.
 * this is expensive. Going through a sample of the data to infer the schema is still a very
costly operation for us. Caching the JSON RDD to disk (doesn't fit in memory) revealed at
least as slow as traversing the sample first, and the whole data next.

*Proposition*

InferSchema is essentially a fold operator. This means a Spark accumulator can easily be built
on top of it in order to calculate a schema alongside an RDD calculation. In the above use-case,
it has two main advantages:
 * the schema is inferred on the entire data, therefore contains all possible fields
 * the computational overhead is negligible since it happens at the same time as writing the
data to an external store rather than by evaluating the RDD for the sole purpose of schema
inference.
 * after writing the manifest to an external store, we can load the JSON data in a Dataset
without ever paying the infer cost again (just the conversion from JSON to Row).

With such feature, users can decide to use their JSON (or whatever else) data as structured
data whenever they want to even though the actual schema may vary every ten minutes as long
as they record the schema of each portion of data.


> Expose InferSchema's functionalities to the outside
> ---------------------------------------------------
>
>                 Key: SPARK-23494
>                 URL: https://issues.apache.org/jira/browse/SPARK-23494
>             Project: Spark
>          Issue Type: Improvement
>          Components: Spark Core, SQL
>    Affects Versions: 2.2.1
>            Reporter: David Courtinot
>            Priority: Major
>
> I'm proposing that InferSchema's internals (infer the schema of each record, merge two
schemata, and canonicalize the result) be exposed to the outside.
> *Use-case*
> My team continuously produces large amounts of JSON data. The schema is and must be very
dynamic: fields can appear and go from one day to another, most fields are nullable, some
fields have small frequency etc.
> In another job, we donwload this data, sample it, infer the schema using Dataset.schema().
From there, we output the data in Parquet for later querying. This approach has proved problematic:
>  *  rare fields can be absent from a sample, and therefore absent from the schema. This
results on exceptions when trying to query those fields. We have had to implement cumbersome
fixes for this involving a manually curated set of required fields.
>  * this is expensive. Going through a sample of the data to infer the schema is still
a very costly operation for us. Caching the JSON RDD to disk (doesn't fit in memory) revealed
to be at least as slow as traversing the sample first, and the whole data next.
> *Proposition*
> InferSchema is essentially a fold operator. This means a Spark accumulator can easily
be built on top of it in order to calculate a schema alongside an RDD calculation. In the
above use-case, it has two main advantages:
>  * the schema is inferred on the entire data, therefore contains all possible fields
>  * the computational overhead is negligible since it happens at the same time as writing
the data to an external store rather than by evaluating the RDD for the sole purpose of schema
inference.
>  * after writing the manifest to an external store, we can load the JSON data in a Dataset
without ever paying the infer cost again (just the conversion from JSON to Row).
> With such feature, users can decide to use their JSON (or whatever else) data as structured
data whenever they want to even though the actual schema may vary every ten minutes as long
as they record the schema of each portion of data.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message