Return-Path: X-Original-To: apmail-asterixdb-dev-archive@minotaur.apache.org Delivered-To: apmail-asterixdb-dev-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 6FF93183A2 for ; Wed, 30 Dec 2015 19:05:30 +0000 (UTC) Received: (qmail 25003 invoked by uid 500); 30 Dec 2015 19:05:30 -0000 Delivered-To: apmail-asterixdb-dev-archive@asterixdb.apache.org Received: (qmail 24942 invoked by uid 500); 30 Dec 2015 19:05:30 -0000 Mailing-List: contact dev-help@asterixdb.incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@asterixdb.incubator.apache.org Delivered-To: mailing list dev@asterixdb.incubator.apache.org Received: (qmail 24928 invoked by uid 99); 30 Dec 2015 19:05:30 -0000 Received: from Unknown (HELO spamd2-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 30 Dec 2015 19:05:30 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd2-us-west.apache.org (ASF Mail Server at spamd2-us-west.apache.org) with ESMTP id 997D61A02C5 for ; Wed, 30 Dec 2015 19:05:29 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd2-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: -0.12 X-Spam-Level: X-Spam-Status: No, score=-0.12 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, RCVD_IN_MSPIKE_H3=-0.01, RCVD_IN_MSPIKE_WL=-0.01, SPF_PASS=-0.001, URIBL_BLOCKED=0.001] autolearn=disabled Authentication-Results: spamd2-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx1-us-west.apache.org ([10.40.0.8]) by localhost (spamd2-us-west.apache.org [10.40.0.9]) (amavisd-new, port 10024) with ESMTP id 96-rB-7nnc2g for ; Wed, 30 Dec 2015 19:05:21 +0000 (UTC) Received: from mail-ob0-f173.google.com (mail-ob0-f173.google.com [209.85.214.173]) by mx1-us-west.apache.org (ASF Mail Server at mx1-us-west.apache.org) with ESMTPS id 5B4A620CCD for ; Wed, 30 Dec 2015 19:05:21 +0000 (UTC) Received: by mail-ob0-f173.google.com with SMTP id 18so276643525obc.2 for ; Wed, 30 Dec 2015 11:05:21 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; bh=9pvo73Y53CwTbH42dYzuNQCRiH+5+3OQOWnzgMfCF7E=; b=UrTCd+gl2X41XP4QLMHhNzRCc/2BvWJVzZKLQDJq68TZC4AclXV711bx3gGmVIdlKL kmuJiiwlPNOPZgsjV3fAN9NTEzFwe7cUvPTlX6hctGSYppB0r16ySuNbXgSAVj13lYHL UwgPf+UkcTQEzH47b4qsqlcLyYU2pcf1yns5engfmtO8zqTmCZwfIo4WX156sSv9DGog wueo1WrR7i2VAYW599aBhacwBIbIW9zzORQ4SiIdTQQr4oRqCbll6SA4zXzof7a42NEF rYPmHpUui9j3AJl0t6TqYAtJH8r5FJva99p1VHor+pgai306wHK4o/YdJo3yqwFli/tp iMQA== MIME-Version: 1.0 X-Received: by 10.182.118.233 with SMTP id kp9mr18719359obb.50.1451502320640; Wed, 30 Dec 2015 11:05:20 -0800 (PST) Received: by 10.202.69.214 with HTTP; Wed, 30 Dec 2015 11:05:20 -0800 (PST) In-Reply-To: References: Date: Wed, 30 Dec 2015 11:05:20 -0800 Message-ID: Subject: Re: Asterix Schema Provider Framework From: Chen Li To: dev@asterixdb.incubator.apache.org Content-Type: text/plain; charset=UTF-8 Sounds very interesting. A basic question about "inference." Is the inferred schema unique? In other words, is it possible to get two schemas from the same instance, especially considering open types and close types? Chen On Fri, Dec 25, 2015 at 3:20 PM, Wail Alkowaileet wrote: > Dears Dev, > > First of all, Happy Holidays :) > > I want to share with you my latest work on AsterixDB, Asterix Schema > Provider Framework. > The design document will be shared soon once I fully integrate it with the > new Asterix Messaging Framework. > > Summary: > The main aim of the Schema Provider Framework is to help the user to > understand the schema of the query result. > > Motivation: > I'm currently working on building AsterxDB-Spark connector. Spark works with > JSON perfectly, however, it has to scan the whole result to infer the > schema. To prevent Spark from doing this pass, Asterix can infer the schema > while materializing the result. > > Additionally, Asterix users can get the schema information in a > Thrift/ADM-like format which can help them to build the required classes to > deserialize the result on their code. > > Brief description of how it works: > Once the user ask for the schema to be inferred, the schema builder will > follow the result printer (APrinterVisitor) to build up the information > about the records, lists and fields types. Then it will compute the final > schema (union) of the resulting output in a single pass. > > User-model: > To see the "tentative" of the user-model, please check the doc: > https://github.com/Nullification/incubator-asterixdb/blob/master/asterix-doc/src/site/markdown/api.md > > Also see the attached images for screenshots of the web-gui interface > including the resulting schema. > > > Future "Ambitious" Applications: > One low-hanging-fruit application is to extend Asterix open/closed to > include yet another type called "inferred". > inferred types will ask Asterix to build the schema information on > ingestion. Inferred types can be very helpful, at least when you have a > schema looks like one of our datasets (see attached wosType.adm) where you > can have multiple fields with similar names and different "schemas" or > nested types. > > inferred type is a hybrid type (closed and open) which can have the > flexibility of the open type and close performance and storage footprint of > the closed type. > > Probably inferred type is good for read-intensive application. For > write-intensive where every CPU cycle counts, this can introduce some > unnecessary overhead. But probably there is a clever solution with some > adaptive sampling techniques. > > I'll be investigating more about this and share my thoughts later on :-)) > > Have a wonderful holiday and happy weekend! > -- > > Regards, > Wail Alkowaileet