Return-Path: X-Original-To: apmail-asterixdb-dev-archive@minotaur.apache.org Delivered-To: apmail-asterixdb-dev-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 70C7418836 for ; Thu, 31 Dec 2015 06:27:06 +0000 (UTC) Received: (qmail 78423 invoked by uid 500); 31 Dec 2015 06:27:06 -0000 Delivered-To: apmail-asterixdb-dev-archive@asterixdb.apache.org Received: (qmail 78366 invoked by uid 500); 31 Dec 2015 06:27:06 -0000 Mailing-List: contact dev-help@asterixdb.incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@asterixdb.incubator.apache.org Delivered-To: mailing list dev@asterixdb.incubator.apache.org Received: (qmail 78354 invoked by uid 99); 31 Dec 2015 06:27:06 -0000 Received: from Unknown (HELO spamd2-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 31 Dec 2015 06:27:06 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd2-us-west.apache.org (ASF Mail Server at spamd2-us-west.apache.org) with ESMTP id 9D4861A03CD for ; Thu, 31 Dec 2015 06:27:05 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd2-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 2.9 X-Spam-Level: ** X-Spam-Status: No, score=2.9 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, HTML_MESSAGE=3, SPF_PASS=-0.001, URIBL_BLOCKED=0.001] autolearn=disabled Authentication-Results: spamd2-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx1-us-east.apache.org ([10.40.0.8]) by localhost (spamd2-us-west.apache.org [10.40.0.9]) (amavisd-new, port 10024) with ESMTP id JP4WevRf276I for ; Thu, 31 Dec 2015 06:26:56 +0000 (UTC) Received: from mail-lb0-f179.google.com (mail-lb0-f179.google.com [209.85.217.179]) by mx1-us-east.apache.org (ASF Mail Server at mx1-us-east.apache.org) with ESMTPS id 3DF38439B4 for ; Thu, 31 Dec 2015 06:26:56 +0000 (UTC) Received: by mail-lb0-f179.google.com with SMTP id sv6so116777856lbb.0 for ; Wed, 30 Dec 2015 22:26:56 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; bh=c/C8bNmXE5M74vcHn3tgIs4Q6V0QKvL0KttH0DdO4JI=; b=eIk+OqrkbXCwuvvxnbAY9XRT8Z1zeUWuKk1TEFNgYfjlMNxY1+gUteYPhjZ2Zxk3HO L/EnHGMY80EpK11wWEt7qdsUhD6vr9tzaDDl7WdPw7DL4iZaYKdjXqggXpU9OuJrxG83 x5TexYr846owXpCuHAmrzonctj5ofSZ+6Duxky9/ye4OCF3e8vqCbHFH0NcK/YTK90yl V+catB34BHLROynu3iA4TMLeOW2qtq1slWaaCFVxPZ+aeteif4jsGVEmqeN9YlUjyWOB B32khEHqBC76l4hOgLMBZGGUgWAb4nI54wOoDxEiLbh0LQnuLezKytdlNh1Xk71hVK/y EiHg== MIME-Version: 1.0 X-Received: by 10.112.235.71 with SMTP id uk7mr19866157lbc.39.1451543214937; Wed, 30 Dec 2015 22:26:54 -0800 (PST) Received: by 10.112.130.195 with HTTP; Wed, 30 Dec 2015 22:26:54 -0800 (PST) In-Reply-To: References: Date: Thu, 31 Dec 2015 09:26:54 +0300 Message-ID: Subject: Re: Asterix Schema Provider Framework From: Wail Alkowaileet To: dev@asterixdb.incubator.apache.org Content-Type: multipart/alternative; boundary=001a11c3c5c2b3e21b05282bbd0a --001a11c3c5c2b3e21b05282bbd0a Content-Type: text/plain; charset=UTF-8 Hi Chen, The schema inferencer API currently works on the printer sides (i.e. it's for the result output). Therefore, the scheme is computed per partition and when the user asks for the schema, the schemas of all partitions get "unioned" with some certain policy defined by the implementation of the schema inferencer API. The inferencer works per item type. Therefore, for open and closed types mix, it doesn't matter if the data is homogeneous (i.e there are *no* two items in the same nesting level having different types) as the resulting schema will be the union with nullables for the missing fields. However, for heterogeneous types, it's again up to the API implementation. In Spark world, heterogeneous types are considered strings and it's up to the user to parse that string. In Asterix case, we might have a different approach by utilizing the current built-in union type. For the "inferred" type, I imagine to have some sort of versioning approach as described in [1] and build a secondary index on "version_id" instead of storing the ids in the property-node. That's why I actually asked about the histograms, which can play a big role about what would be the expected schema for a query at compile time instead of inspecting every type by the execution engine. It's a JIT-like compiler for AQL. I know it sounds "ugly" as it probably requires index and metadata look ups for every insert. But the whole idea is undercooked and needs more elaboration to have a good picture if that would be beneficial. [1] http://btw-2015.de/res/proceedings/Hauptband/Wiss/Klettke-Schema_Extraction_and_Stru.pdf Thanks and Happy New Year :-) On Wed, Dec 30, 2015 at 10:05 PM, Chen Li wrote: > Sounds very interesting. A basic question about "inference." Is the > inferred schema unique? In other words, is it possible to get two > schemas from the same instance, especially considering open types and > close types? > > Chen > > On Fri, Dec 25, 2015 at 3:20 PM, Wail Alkowaileet > wrote: > > Dears Dev, > > > > First of all, Happy Holidays :) > > > > I want to share with you my latest work on AsterixDB, Asterix Schema > > Provider Framework. > > The design document will be shared soon once I fully integrate it with > the > > new Asterix Messaging Framework. > > > > Summary: > > The main aim of the Schema Provider Framework is to help the user to > > understand the schema of the query result. > > > > Motivation: > > I'm currently working on building AsterxDB-Spark connector. Spark works > with > > JSON perfectly, however, it has to scan the whole result to infer the > > schema. To prevent Spark from doing this pass, Asterix can infer the > schema > > while materializing the result. > > > > Additionally, Asterix users can get the schema information in a > > Thrift/ADM-like format which can help them to build the required classes > to > > deserialize the result on their code. > > > > Brief description of how it works: > > Once the user ask for the schema to be inferred, the schema builder will > > follow the result printer (APrinterVisitor) to build up the information > > about the records, lists and fields types. Then it will compute the final > > schema (union) of the resulting output in a single pass. > > > > User-model: > > To see the "tentative" of the user-model, please check the doc: > > > https://github.com/Nullification/incubator-asterixdb/blob/master/asterix-doc/src/site/markdown/api.md > > > > Also see the attached images for screenshots of the web-gui interface > > including the resulting schema. > > > > > > Future "Ambitious" Applications: > > One low-hanging-fruit application is to extend Asterix open/closed to > > include yet another type called "inferred". > > inferred types will ask Asterix to build the schema information on > > ingestion. Inferred types can be very helpful, at least when you have a > > schema looks like one of our datasets (see attached wosType.adm) where > you > > can have multiple fields with similar names and different "schemas" or > > nested types. > > > > inferred type is a hybrid type (closed and open) which can have the > > flexibility of the open type and close performance and storage footprint > of > > the closed type. > > > > Probably inferred type is good for read-intensive application. For > > write-intensive where every CPU cycle counts, this can introduce some > > unnecessary overhead. But probably there is a clever solution with some > > adaptive sampling techniques. > > > > I'll be investigating more about this and share my thoughts later on :-)) > > > > Have a wonderful holiday and happy weekend! > > -- > > > > Regards, > > Wail Alkowaileet > -- *Regards,* Wail Alkowaileet --001a11c3c5c2b3e21b05282bbd0a--