asterixdb-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Wail Alkowaileet <wael....@gmail.com>
Subject Asterix Schema Provider Framework
Date Fri, 25 Dec 2015 23:20:30 GMT
Dears Dev,

First of all, Happy Holidays :)

I want to share with you my latest work on AsterixDB, Asterix Schema
Provider Framework.
The design document will be shared soon once I fully integrate it with the
new Asterix Messaging Framework.

*Summary:*
The main aim of the Schema Provider Framework is to help the user to
understand the schema of the query result.

*Motivation:*
I'm currently working on building AsterxDB-Spark connector. Spark works
with JSON perfectly, however, it has to scan the whole result to infer the
schema. To prevent Spark from doing this pass, Asterix can infer the schema
while materializing the result.

Additionally, Asterix users can get the schema information in a
Thrift/ADM-like format which can help them to build the required classes to
deserialize the result on their code.

*Brief description of how it works:*
Once the user ask for the schema to be inferred, the schema builder will
follow the result printer (APrinterVisitor) to build up the information
about the records, lists and fields types. Then it will compute the final
schema (union) of the resulting output in a single pass.

*User-model:*
To see the "tentative" of the user-model, please check the doc:
https://github.com/Nullification/incubator-asterixdb/blob/master/asterix-doc/src/site/markdown/api.md

Also see the attached images for screenshots of the web-gui interface
including the resulting schema.


*Future "Ambitious" Applications:*
One low-hanging-fruit application is to extend Asterix open/closed to
include yet another type called "inferred".
inferred types will ask Asterix to build the schema information on
ingestion. Inferred types can be very helpful, at least when you have a
schema looks like one of our datasets (see attached wosType.adm) where you
can have multiple fields with similar names and different "schemas" or
nested types.

inferred type is a hybrid type (closed and open) which can have the
flexibility of the *open type* and close performance and storage footprint
of the *closed type*.

Probably inferred type is good for read-intensive application. For
write-intensive where every CPU cycle counts, this can introduce some
unnecessary overhead. But probably there is a clever solution with some
adaptive sampling techniques.

I'll be investigating more about this and share my thoughts later on :-))

Have a wonderful holiday and happy weekend!
-- 

*Regards,*
Wail Alkowaileet

Mime
  • Unnamed multipart/mixed (inline, None, 0 bytes)
View raw message