asterixdb-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Till Westmann" <ti...@apache.org>
Subject Re: Asterix Schema Provider Framework
Date Wed, 13 Jan 2016 22:20:46 GMT
Hi Wail,

thanks for writing this up!

I took a brief look and everything good great, but there’s one thing 
that surprised me a bit: the modifications in Algebricks. It seemed to 
me that all the actual data and schema management should happen in 
AsterixDB and that Algebricks doesn’t really need to know about this.
Is there a (clean) way to keep all of this in AsterixDB?
Or do you think that we need a (possibly more generic) extension point 
in Algebricks to support this feature?

Cheers,
Till

On 13 Jan 2016, at 14:04, Wail Alkowaileet wrote:

> Sorry I forgot to put a link to the code:
> https://github.com/Nullification/incubator-asterixdb
> https://github.com/Nullification/incubator-asterixdb-hyracks
>
> it currently lives in my github, I will push it soon to the gerrit.
>
> Thanks.
>
> On Wed, Jan 13, 2016 at 4:55 PM, Wail Alkowaileet <wael.y.k@gmail.com>
> wrote:
>
>> Hello Chen,
>>
>> Sorry for the late reply,, I was hammered preparing for a workshop 
>> here in
>> Boston.
>> Also I wanted to prepare a comprehensive design document that 
>> includes all
>> the details about schema inferencer framework I built.
>>
>> Please refer to it @:
>> https://docs.google.com/document/d/1Ue-yAWoLChOJ8JlkbXWdDW0tSW9szzP76ePL4wQRmP0/edit#
>>
>> So just for the sake of your time (the document is a bit long):
>> Let's assume we have the following input:
>>
>> {name: {
>> display_name: "Boxer, Laurence",
>> first_name: "Laurence",
>> full_name: "Boxer, Laurence",
>> reprint: "Y",
>> role: "author",
>> wos_standard: "Boxer, L",
>> last_name: "Boxer",
>> seq_no: "1"
>> }}
>>
>> {name:{
>> display_name: "Adamek, Jiri",
>> first_name: "Jiri",
>> addr_no: "1",
>> full_name: "Adamek, Jiri",
>> reprint: "Y",
>> role: "author",
>> wos_standard: "Adamek, J",
>> last_name: "Adamek",
>> dais_id: "10121636",
>> seq_no: "1"
>> }}
>>
>> As the "tuples" are all of type record, the schema inferencer will 
>> compute
>> the schema as the union of all records fields.
>>
>> *as an ADM:*
>> create type nameType1 as closed{
>>
>> display_name: string,
>> first_name:string,
>> addr_no:string?,
>> full_name: string,
>> reprint:string,
>> role:string,
>> wos_standard:string,
>> last_name:string,
>> dais_id:string?,
>> seq_no:string
>>
>> }
>>
>> create datasetType as closed{
>>
>> name: nameType1
>>
>> }
>>
>> However for heterogeneous types as in the following example:
>>
>> name: {
>> display_name: "Boxer, Laurence",
>> first_name: "Laurence",
>> full_name: "Boxer, Laurence",
>> reprint: "Y",
>> role: "author",
>> wos_standard: "Boxer, L",
>> last_name: "Boxer",
>> seq_no: "1"
>> }
>>
>> name: [
>> {
>>     display_name: "Adamek, Jiri",
>>     first_name: "Jiri",
>>     addr_no: "1",
>>     full_name: "Adamek, Jiri",
>>     reprint: "Y",
>>     role: "author",
>>     wos_standard: "Adamek, J",
>>     last_name: "Adamek",
>>     dais_id: "10121636",
>>     seq_no: "1"
>> },
>> {
>>     display_name: "Koubek, Vaclav",
>>     first_name: "Vaclav",
>>     addr_no: "2",
>>     full_name: "Koubek, Vaclav",
>>     role: "author",
>>     wos_standard: "Koubek, V",
>>     last_name: "Koubek",
>>     dais_id: "12279647",
>>     seq_no: "2"
>> }
>> ]
>>
>> As you can see that field "name" is sometimes a record and sometimes 
>> is an
>> ordered list. What Apache Spark does it infers name simply as a 
>> String.
>>
>> In Asterix case, we can infer this type as UNION of both record and a 
>> list
>> of records.
>>
>> *as an ADM:*
>> create type nameType1 as closed{
>>
>> display_name: string,
>> first_name:string,
>> full_name: string,
>> reprint:string,
>> role:string,
>> wos_standard:string,
>> last_name:string,
>> seq_no:string
>>
>> }
>>
>> create type nameType2 as closed{
>>
>> display_name: string,
>> first_name:string,
>> addr_no:string,
>> full_name: string,
>> reprint:string,
>> role:string,
>> wos_standard:string,
>> last_name:string,
>> dais_id:string,
>> seq_no:string
>>
>> }
>>
>> create datasetType as closed{
>>
>> name: union(nameType1, [nameType2])
>>
>> }
>>
>>
>>
>
>
> -- 
>
> *Regards,*
> Wail Alkowaileet

Mime
View raw message