spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael Armbrust <>
Subject Re: Spark SQL JSON Column Support
Date Thu, 29 Sep 2016 19:13:09 GMT
> Will this be able to handle projection pushdown if a given job doesn't
> utilize all the columns in the schema?  Or should people have a

per-job schema?

As currently written, we will do a little bit of extra work to pull out
fields that aren't needed.  I think it would be pretty straight forward to
add a rule to the optimizer that prunes the schema passed to the
JsonToStruct expression when there is another Project operator present.

I’m not a spark guru, but I would have hoped that DataSets and DataFrames
> were more dynamic.

We are dynamic in that all of these decisions can be made at runtime, and
you can even look at the data when making them.  We do however need to know
the schema before any single query begins executing so that we can give
good analysis error messages and so that we can generate efficient byte
code in our code generation.

> You should be doing schema inference. JSON includes the schema with each
> record and you should take advantage of it. I guess the only issue is
> that DataSets / DataFrames have static schemas and structures. Then if your
> first record doesn’t include all of the columns you will have a problem.

I agree that for ad-hoc use cases we should make it easy to infer the
schema.  I would also argue that for a production pipeline you need the
ability to specify it manually to avoid surprises.

There are several tricky cases here.  You bring up the fact that the first
record might be missing fields, but in many data sets there are fields that
are only present in 1 out of 100,000s records.  Even if all fields are
present, sometimes it can be very expensive to get even the first record
(say you are reading from an expensive query coming from the JDBC data

Another issue, is that inference means you need to read some data before
the user explicitly starts the query.  Historically, cases where we do this
have been pretty confusing to users of Spark (think: the surprise job that
finds partition boundaries for RDD.sort).

So, I think we should add inference, but that it should be in addition to
the API proposed in this PR.

View raw message