Mailing-List: contact dev-help@asterixdb.incubator.apache.org; run by ezmlm
Precedence: bulk
Reply-To: dev@asterixdb.incubator.apache.org
MIME-Version: 1.0
In-Reply-To: 
 <CANQf6DqB_RL4tB8wtw4Yhyg=hEACjNv42ZyVbb22PcLjwHzHSQ@mail.gmail.com>
References: 
 <CALgBV_e4PH7RjQmaqS1kQnOKUyYUPkVFJp_EUek3TEpbEXQORw@mail.gmail.com>
	<CANQf6DqB_RL4tB8wtw4Yhyg=hEACjNv42ZyVbb22PcLjwHzHSQ@mail.gmail.com>
Date: Thu, 31 Dec 2015 09:26:54 +0300
Message-ID: 
 <CALgBV_eUXQOiTrGJy7=BsSh2B00Cks7y39mmEfP8Yp1DsZerzg@mail.gmail.com>
Subject: Re: Asterix Schema Provider Framework
From: Wail Alkowaileet <wael.y.k@gmail.com>
To: dev@asterixdb.incubator.apache.org
Content-Type: multipart/alternative; boundary=001a11c3c5c2b3e21b05282bbd0a

--001a11c3c5c2b3e21b05282bbd0a
Content-Type: text/plain; charset=UTF-8

Hi Chen,

The schema inferencer API currently works on the printer sides (i.e. it's
for the result output). Therefore, the scheme is computed per partition and
when the user asks for the schema, the schemas of all partitions get
"unioned" with some certain policy defined by the implementation of the
schema inferencer API.

The inferencer works per item type. Therefore, for open and closed types
mix, it doesn't matter if the data is homogeneous (i.e there are *no* two
items in the same nesting level having different types) as the resulting
schema will be the union with nullables for the missing fields. However,
for heterogeneous types, it's again up to the API implementation. In Spark
world, heterogeneous types are considered strings and it's up to the user
to parse that string. In Asterix case, we might have a different approach
by utilizing the current built-in union type.

For the "inferred" type, I imagine to have some sort of versioning approach
as described in [1] and build a secondary index on "version_id" instead of
storing the ids in the property-node. That's why I actually asked about the
histograms, which can play a big role about what would be the expected
schema for a query at compile time instead of inspecting every type by the
execution engine. It's a JIT-like compiler for AQL.

I know it sounds "ugly" as it probably requires index and metadata look ups
for every insert. But the whole idea is undercooked and needs more
elaboration to have a good picture if that would be beneficial.

[1]
http://btw-2015.de/res/proceedings/Hauptband/Wiss/Klettke-Schema_Extraction_and_Stru.pdf

Thanks and Happy New Year :-)


On Wed, Dec 30, 2015 at 10:05 PM, Chen Li <chenli@gmail.com> wrote:

> Sounds very interesting.  A basic question about "inference."  Is the
> inferred schema unique?  In other words, is it possible to get two
> schemas from the same instance, especially considering open types and
> close types?
>
> Chen
>
> On Fri, Dec 25, 2015 at 3:20 PM, Wail Alkowaileet <wael.y.k@gmail.com>
> wrote:
> > Dears Dev,
> >
> > First of all, Happy Holidays :)
> >
> > I want to share with you my latest work on AsterixDB, Asterix Schema
> > Provider Framework.
> > The design document will be shared soon once I fully integrate it with
> the
> > new Asterix Messaging Framework.
> >
> > Summary:
> > The main aim of the Schema Provider Framework is to help the user to
> > understand the schema of the query result.
> >
> > Motivation:
> > I'm currently working on building AsterxDB-Spark connector. Spark works
> with
> > JSON perfectly, however, it has to scan the whole result to infer the
> > schema. To prevent Spark from doing this pass, Asterix can infer the
> schema
> > while materializing the result.
> >
> > Additionally, Asterix users can get the schema information in a
> > Thrift/ADM-like format which can help them to build the required classes
> to
> > deserialize the result on their code.
> >
> > Brief description of how it works:
> > Once the user ask for the schema to be inferred, the schema builder will
> > follow the result printer (APrinterVisitor) to build up the information
> > about the records, lists and fields types. Then it will compute the final
> > schema (union) of the resulting output in a single pass.
> >
> > User-model:
> > To see the "tentative" of the user-model, please check the doc:
> >
> https://github.com/Nullification/incubator-asterixdb/blob/master/asterix-doc/src/site/markdown/api.md
> >
> > Also see the attached images for screenshots of the web-gui interface
> > including the resulting schema.
> >
> >
> > Future "Ambitious" Applications:
> > One low-hanging-fruit application is to extend Asterix open/closed to
> > include yet another type called "inferred".
> > inferred types will ask Asterix to build the schema information on
> > ingestion. Inferred types can be very helpful, at least when you have a
> > schema looks like one of our datasets (see attached wosType.adm) where
> you
> > can have multiple fields with similar names and different "schemas" or
> > nested types.
> >
> > inferred type is a hybrid type (closed and open) which can have the
> > flexibility of the open type and close performance and storage footprint
> of
> > the closed type.
> >
> > Probably inferred type is good for read-intensive application. For
> > write-intensive where every CPU cycle counts, this can introduce some
> > unnecessary overhead. But probably there is a clever solution with some
> > adaptive sampling techniques.
> >
> > I'll be investigating more about this and share my thoughts later on :-))
> >
> > Have a wonderful holiday and happy weekend!
> > --
> >
> > Regards,
> > Wail Alkowaileet
>


-- 

*Regards,*
Wail Alkowaileet

--001a11c3c5c2b3e21b05282bbd0a--