Return-Path: X-Original-To: apmail-asterixdb-dev-archive@minotaur.apache.org Delivered-To: apmail-asterixdb-dev-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 52CBD1883B for ; Wed, 13 Jan 2016 21:55:30 +0000 (UTC) Received: (qmail 82411 invoked by uid 500); 13 Jan 2016 21:55:30 -0000 Delivered-To: apmail-asterixdb-dev-archive@asterixdb.apache.org Received: (qmail 82356 invoked by uid 500); 13 Jan 2016 21:55:30 -0000 Mailing-List: contact dev-help@asterixdb.incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@asterixdb.incubator.apache.org Delivered-To: mailing list dev@asterixdb.incubator.apache.org Received: (qmail 82343 invoked by uid 99); 13 Jan 2016 21:55:29 -0000 Received: from Unknown (HELO spamd3-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 13 Jan 2016 21:55:29 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd3-us-west.apache.org (ASF Mail Server at spamd3-us-west.apache.org) with ESMTP id 4863C1804CA for ; Wed, 13 Jan 2016 21:55:29 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd3-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 2.88 X-Spam-Level: ** X-Spam-Status: No, score=2.88 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, HTML_MESSAGE=3, RCVD_IN_MSPIKE_H3=-0.01, RCVD_IN_MSPIKE_WL=-0.01, SPF_PASS=-0.001, URIBL_BLOCKED=0.001] autolearn=disabled Authentication-Results: spamd3-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx1-eu-west.apache.org ([10.40.0.8]) by localhost (spamd3-us-west.apache.org [10.40.0.10]) (amavisd-new, port 10024) with ESMTP id VaeRHQC0hH6I for ; Wed, 13 Jan 2016 21:55:27 +0000 (UTC) Received: from mail-lf0-f47.google.com (mail-lf0-f47.google.com [209.85.215.47]) by mx1-eu-west.apache.org (ASF Mail Server at mx1-eu-west.apache.org) with ESMTPS id 35A48258C7 for ; Wed, 13 Jan 2016 21:55:26 +0000 (UTC) Received: by mail-lf0-f47.google.com with SMTP id h129so69358612lfh.3 for ; Wed, 13 Jan 2016 13:55:26 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; bh=AF5XdjFxAtuSQjYqlEILsMbh21gewXrJf1kIX3C+X4g=; b=Apz9l72aLmGENBINLeaKRrInMPh7CAxwYXK+PXiwn8Qo4KJdFYebStTJ/U3bc2Jhr9 vLh5YFZymybSltqsi/ziFqczWgBn+z9WLMExcnBVI7e6h/+E2YMOn2J6BkOOx6RZbl7h 2iHDMXKQXlRMHfZHsUDvmv6RIPj3L3O1xnMkbHUfCIkKtKyJs6EoNZWBRCrvo08Eud7I RAXJv2jDGgUsFi5wjzKL/OBT+0tI1fJSzaPSFdlkm7YmKbR+hV+CUBovtd/JGTSSNal+ EHGdogmDofleHiYGfml31O7Iqgv+oRcOwImSIQsEowyLw+JdM4U+A7G65kSp6p2IW3kQ Fjcw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:in-reply-to:references:date :message-id:subject:from:to:content-type; bh=AF5XdjFxAtuSQjYqlEILsMbh21gewXrJf1kIX3C+X4g=; b=b35+jFJtabJFeVOw1wmPRbz+FnH3ZNrZfr9Yzf54p4yuwyAwOPmf0R+ZvoXUJUh4CW T9ZAgUn9r8dOwYLetu4WdPfApqSzT4YrEMXXzMrrY3qQ8JY+OgzpICpCGDQrCqr91kId Kr1lfKGj8t7llN+hUeTmEXsq5/wazYFjhxB+KsPqkrjpFjKCyrpxZ5jKdgGPg8ztNJwd i6qZXMGBl/tkmKJ3mRB1zmJ9Ixj58xjpP1rrZvwg2PAgWiphLiq3ZG0glm8AgPhitMcI wpbbkB+WiPfq5v3sLOGFORgrnD4JYrETc/5ae8yWl+YA7czbp069/D7d/rIGM6m6sSgh RQ4g== X-Gm-Message-State: ALoCoQm8NuMgvwUgtm82ndyoT6kR0flQlL8wdtcavM3lIIXvXyrpGHgpWJgIvrU4FMjzYMzQgSdKKsEcFDnJWM0cCV+NtxtHSQ== MIME-Version: 1.0 X-Received: by 10.25.161.144 with SMTP id k138mr178343lfe.83.1452722125526; Wed, 13 Jan 2016 13:55:25 -0800 (PST) Received: by 10.112.130.195 with HTTP; Wed, 13 Jan 2016 13:55:25 -0800 (PST) In-Reply-To: References: Date: Wed, 13 Jan 2016 16:55:25 -0500 Message-ID: Subject: Re: Asterix Schema Provider Framework From: Wail Alkowaileet To: dev@asterixdb.incubator.apache.org Content-Type: multipart/alternative; boundary=001a114028a63fe26505293e3a03 --001a114028a63fe26505293e3a03 Content-Type: text/plain; charset=UTF-8 Hello Chen, Sorry for the late reply,, I was hammered preparing for a workshop here in Boston. Also I wanted to prepare a comprehensive design document that includes all the details about schema inferencer framework I built. Please refer to it @: https://docs.google.com/document/d/1Ue-yAWoLChOJ8JlkbXWdDW0tSW9szzP76ePL4wQRmP0/edit# So just for the sake of your time (the document is a bit long): Let's assume we have the following input: {name: { display_name: "Boxer, Laurence", first_name: "Laurence", full_name: "Boxer, Laurence", reprint: "Y", role: "author", wos_standard: "Boxer, L", last_name: "Boxer", seq_no: "1" }} {name:{ display_name: "Adamek, Jiri", first_name: "Jiri", addr_no: "1", full_name: "Adamek, Jiri", reprint: "Y", role: "author", wos_standard: "Adamek, J", last_name: "Adamek", dais_id: "10121636", seq_no: "1" }} As the "tuples" are all of type record, the schema inferencer will compute the schema as the union of all records fields. *as an ADM:* create type nameType1 as closed{ display_name: string, first_name:string, addr_no:string?, full_name: string, reprint:string, role:string, wos_standard:string, last_name:string, dais_id:string?, seq_no:string } create datasetType as closed{ name: nameType1 } However for heterogeneous types as in the following example: name: { display_name: "Boxer, Laurence", first_name: "Laurence", full_name: "Boxer, Laurence", reprint: "Y", role: "author", wos_standard: "Boxer, L", last_name: "Boxer", seq_no: "1" } name: [ { display_name: "Adamek, Jiri", first_name: "Jiri", addr_no: "1", full_name: "Adamek, Jiri", reprint: "Y", role: "author", wos_standard: "Adamek, J", last_name: "Adamek", dais_id: "10121636", seq_no: "1" }, { display_name: "Koubek, Vaclav", first_name: "Vaclav", addr_no: "2", full_name: "Koubek, Vaclav", role: "author", wos_standard: "Koubek, V", last_name: "Koubek", dais_id: "12279647", seq_no: "2" } ] As you can see that field "name" is sometimes a record and sometimes is an ordered list. What Apache Spark does it infers name simply as a String. In Asterix case, we can infer this type as UNION of both record and a list of records. *as an ADM:* create type nameType1 as closed{ display_name: string, first_name:string, full_name: string, reprint:string, role:string, wos_standard:string, last_name:string, seq_no:string } create type nameType2 as closed{ display_name: string, first_name:string, addr_no:string, full_name: string, reprint:string, role:string, wos_standard:string, last_name:string, dais_id:string, seq_no:string } create datasetType as closed{ name: union(nameType1, [nameType2]) } --001a114028a63fe26505293e3a03--