spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael Allman <>
Subject Re: [Spark SQL] Making InferSchema and JacksonParser public
Date Wed, 18 Jan 2017 17:21:19 GMT
Personally I'd love to see some kind of pluggability, configurability in the JSON schema parsing,
maybe as an option in the DataFrameReader. Perhaps you can propose an API?

> On Jan 18, 2017, at 5:51 AM, Brian Hong <> wrote:
> I work for a mobile game company. I'm solving a simple question: "Can we efficiently/cheaply
query for the log of a particular user within given date period?"
> I've created a special JSON text-based file format that has these traits:
>  - Snappy compressed, saved in AWS S3
>  - Partitioned by date. ie. <>,
<>, ...
>  - Sorted by a primary key (log_type) and a secondary key (user_id), Snappy block compressed
by 5MB blocks
>  - Blocks are indexed with primary/secondary key in file 2017-01-01.json
>  - Efficient block based random access on primary key (log_type) and secondary key (user_id)
using the index
> I've created a Spark SQL DataFrame relation that can query this file format.  Since the
schema of each log type is fairly consistent, I've reused the `InferSchema.inferSchema` method
and `JacksonParser`in the Spark SQL code to support structured querying.  I've also implemented
filter push-down to optimize the file access.
> It is very fast when querying for a single user or querying for a single log type with
a sampling ratio of 10000 to 1 compared to parquet file format.  (We do use parquet for some
log types when we need batch analysis.)
> One of the problems we face is that the methods we use above are private API.  So we
are forced to use hacks to use these methods.  (Things like copying the code or using the
org.apache.spark.sql package namespace)
> I've been following Spark SQL code since 1.4, and the JSON schema inferencing code and
JacksonParser seem to be relatively stable recently.  Can the core-devs make these APIs public?
> We are willing to open source this file format because it is very excellent for archiving
user related logs in S3.  The key dependency of private APIs in Spark SQL is the main hurdle
in making this a reality.
> Thank you for reading!

View raw message