asterixdb-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Till Westmann" <ti...@apache.org>
Subject Re: Asterix Schema Provider Framework
Date Wed, 13 Jan 2016 23:00:14 GMT
Hi Wail,

creating one more object per query doesn’t scare me too much - 
especially if it’s only done if the schema is actually requested :) 
Also, it doesn’t seem that the construction would be very expensive.
So I think that that I’d prefer the alternate way.

What do other people think?

Cheers,
Till

On 13 Jan 2016, at 14:49, Wail Alkowaileet wrote:

> Hi Till,
>
> I'm glad you brought that up. I tried to think about a better approach
> where the whole thing lives in Asterix.
>
> The problem appears when I need to pass the information to 
> SchemaBuilder
> which lives in the "custom" IPrinterFactory. AFAIK, there're only two
> paths. Either by doing it the way I did. (i.e. 
> JobGenHelper.mkPrinters()
> set the SchemaID and the HeterogeneousTypeComputer). Or.. for every 
> query,
> I create a new AqlCleanJSONWithSchemaPrinterFactoryProvider that holds 
> the
> information the SchemaBuilder needs. Then it prepares IPrinterFactory 
> with
> the necessary information. Both way works ..
>
> I chose the first one as I wanted to keep the same singleton pattern 
> of all
> implementation of IPrinterFactoryProvider.
> So it's actually possible :-))
>
> If that seems better, I can re-do it that way.
>
> Thanks.
>
> On Wed, Jan 13, 2016 at 5:20 PM, Till Westmann <tillw@apache.org> 
> wrote:
>
>> Hi Wail,
>>
>> thanks for writing this up!
>>
>> I took a brief look and everything good great, but there’s one 
>> thing that
>> surprised me a bit: the modifications in Algebricks. It seemed to me 
>> that
>> all the actual data and schema management should happen in AsterixDB 
>> and
>> that Algebricks doesn’t really need to know about this.
>> Is there a (clean) way to keep all of this in AsterixDB?
>> Or do you think that we need a (possibly more generic) extension 
>> point in
>> Algebricks to support this feature?
>>
>> Cheers,
>> Till
>>
>>
>> On 13 Jan 2016, at 14:04, Wail Alkowaileet wrote:
>>
>> Sorry I forgot to put a link to the code:
>>> https://github.com/Nullification/incubator-asterixdb
>>> https://github.com/Nullification/incubator-asterixdb-hyracks
>>>
>>> it currently lives in my github, I will push it soon to the gerrit.
>>>
>>> Thanks.
>>>
>>> On Wed, Jan 13, 2016 at 4:55 PM, Wail Alkowaileet 
>>> <wael.y.k@gmail.com>
>>> wrote:
>>>
>>> Hello Chen,
>>>>
>>>> Sorry for the late reply,, I was hammered preparing for a workshop 
>>>> here
>>>> in
>>>> Boston.
>>>> Also I wanted to prepare a comprehensive design document that 
>>>> includes
>>>> all
>>>> the details about schema inferencer framework I built.
>>>>
>>>> Please refer to it @:
>>>>
>>>> https://docs.google.com/document/d/1Ue-yAWoLChOJ8JlkbXWdDW0tSW9szzP76ePL4wQRmP0/edit#
>>>>
>>>> So just for the sake of your time (the document is a bit long):
>>>> Let's assume we have the following input:
>>>>
>>>> {name: {
>>>> display_name: "Boxer, Laurence",
>>>> first_name: "Laurence",
>>>> full_name: "Boxer, Laurence",
>>>> reprint: "Y",
>>>> role: "author",
>>>> wos_standard: "Boxer, L",
>>>> last_name: "Boxer",
>>>> seq_no: "1"
>>>> }}
>>>>
>>>> {name:{
>>>> display_name: "Adamek, Jiri",
>>>> first_name: "Jiri",
>>>> addr_no: "1",
>>>> full_name: "Adamek, Jiri",
>>>> reprint: "Y",
>>>> role: "author",
>>>> wos_standard: "Adamek, J",
>>>> last_name: "Adamek",
>>>> dais_id: "10121636",
>>>> seq_no: "1"
>>>> }}
>>>>
>>>> As the "tuples" are all of type record, the schema inferencer will
>>>> compute
>>>> the schema as the union of all records fields.
>>>>
>>>> *as an ADM:*
>>>>
>>>> create type nameType1 as closed{
>>>>
>>>> display_name: string,
>>>> first_name:string,
>>>> addr_no:string?,
>>>> full_name: string,
>>>> reprint:string,
>>>> role:string,
>>>> wos_standard:string,
>>>> last_name:string,
>>>> dais_id:string?,
>>>> seq_no:string
>>>>
>>>> }
>>>>
>>>> create datasetType as closed{
>>>>
>>>> name: nameType1
>>>>
>>>> }
>>>>
>>>> However for heterogeneous types as in the following example:
>>>>
>>>> name: {
>>>> display_name: "Boxer, Laurence",
>>>> first_name: "Laurence",
>>>> full_name: "Boxer, Laurence",
>>>> reprint: "Y",
>>>> role: "author",
>>>> wos_standard: "Boxer, L",
>>>> last_name: "Boxer",
>>>> seq_no: "1"
>>>> }
>>>>
>>>> name: [
>>>> {
>>>>  display_name: "Adamek, Jiri",
>>>>  first_name: "Jiri",
>>>>  addr_no: "1",
>>>>  full_name: "Adamek, Jiri",
>>>>  reprint: "Y",
>>>>  role: "author",
>>>>  wos_standard: "Adamek, J",
>>>>  last_name: "Adamek",
>>>>  dais_id: "10121636",
>>>>  seq_no: "1"
>>>> },
>>>> {
>>>>  display_name: "Koubek, Vaclav",
>>>>  first_name: "Vaclav",
>>>>  addr_no: "2",
>>>>  full_name: "Koubek, Vaclav",
>>>>  role: "author",
>>>>  wos_standard: "Koubek, V",
>>>>  last_name: "Koubek",
>>>>  dais_id: "12279647",
>>>>  seq_no: "2"
>>>> }
>>>> ]
>>>>
>>>> As you can see that field "name" is sometimes a record and 
>>>> sometimes is
>>>> an
>>>> ordered list. What Apache Spark does it infers name simply as a 
>>>> String.
>>>>
>>>> In Asterix case, we can infer this type as UNION of both record and 
>>>> a
>>>> list
>>>> of records.
>>>>
>>>> *as an ADM:*
>>>> create type nameType1 as closed{
>>>>
>>>> display_name: string,
>>>> first_name:string,
>>>> full_name: string,
>>>> reprint:string,
>>>> role:string,
>>>> wos_standard:string,
>>>> last_name:string,
>>>> seq_no:string
>>>>
>>>> }
>>>>
>>>> create type nameType2 as closed{
>>>>
>>>> display_name: string,
>>>> first_name:string,
>>>> addr_no:string,
>>>> full_name: string,
>>>> reprint:string,
>>>> role:string,
>>>> wos_standard:string,
>>>> last_name:string,
>>>> dais_id:string,
>>>> seq_no:string
>>>>
>>>> }
>>>>
>>>> create datasetType as closed{
>>>>
>>>> name: union(nameType1, [nameType2])
>>>>
>>>> }
>>>>
>>>>
>>>>
>>>>
>>>
>>> --
>>>
>>> *Regards,*
>>> Wail Alkowaileet
>>>
>>
>
>
> -- 
>
> *Regards,*
> Wail Alkowaileet

Mime
View raw message