asterixdb-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Raman Grover <ramangrove...@gmail.com>
Subject Re: SubscribeFeedStatement
Date Wed, 30 Sep 2015 07:01:21 GMT
Hi,

Please allow me to shed light on this.

*A few preliminaries:*


*1)* A *connect feed* statement (connect feed x to dataset y) is rewritten
as per the following template:

 *for $x in feed-collect(params...) *
*  return $x *

OR

 * for $x in feed-collect(params...) *
*  return f($x )*
  *// where f() represents an AQL/Java function that needs to be applied to
each records prior to persistence. *

Connect feed statement can be regarded as "syntactic sugar". The actual
insert statement as compiled produced an ingestion pipeline with all the
index (including secondary) insert operators.

*2)*  When building the flow of data from external source to the target
dataset, one has two options:

   a) use the feed adaptor to retrieve records from external source.
   b) use an existing active feed to gain access to the records flowing
within the AsterixDB system as part of an ongoing Hyracks job and
further process them to redirect into separate target indexes.

*3) *End-user is not expected to know and write an optimal insert statement
to figure out the best way to produce records that define a feed.  End-user
is only exposed a simplistic connect feed statement.

Read on...

A connect feed statement in analyzed to ascertain if any existing flow of
records across a parent feed could be used. This required a look up into
the in-memory data-structure maintained by the FeedLifecycleListener (a
thread in CC). If any parent feed is present, then the goal is to
**subscribe** to it rather than re-building the flow from the external
source via another channel. Depending on which ancestor of the given feed
is active, there could be additional pre-processing required - here I am
referring to all the UDFs associated with the parent feed(s) up till the
active ancestor. Information on these UDFs is obtained from a look of the
Metadata.
Once the best way to build the ingestion pipeline for a given connect feed
statement has been determined, the request to received data from the
ancestor feed, possibly apply a sequence of UDFs and  direct the output to
a target dataset is expressed as a **subscription** request  - that is a
*SubscribeFeedStatement*.  This statement is not exposed to the end user -
it doesnt even have a syntax, but it contains all the required info to
build the AQL (as per the template) described in (1) from list of
preliminaries above. The resulting AQL has the right parameters for the
feed-collect internal function. These parameters capture the parent feed,
and the specific locations where the operators are running so that the
pipeline for the feed being constructed can be corrected located/scheduled
on the cluster so that data may subsequently flow in different directions
along multiple pipelines in a concurrent manner.


The SubscribedFeed statement is an internal statement that builds the right
AQL counterpart of the simplistic vanilla connect feed statement. It can be
regarded as an intermediate representation of a connect feed statement;
note that the connect feed statement is not understood by the compiler
neither is the SubscribeFeedStatement. It is the AQL translation of the
SubscribeFeedStatement that is actually an insert statement (refer to the
template from preliminary (1)) that is understood by the compiler to
produce the right DAG with right set of index insert operators downstream
and the right locations for the intake operators upstream to receive the
feed records or subscribe to the records flowing in another pipeline.


Details on how the statement re-writing and its translation into AQL is
done is further described in detail in my thesis
<https://www.dropbox.com/s/krmwaokt96xmxij/PhD_Dissertation_Raman_Grover.pdf?dl=0>

I hope I have answered the question as to why SubscribeFeedStatement is not
exposed to the end-user? why it requires a Metadata look up? and why is the
original connect feed statement is handed to the compiler again (in form an
an (insert) AQL) .

In case I did not clarify certain aspects, which are also not elaborated
enough in the thesis, please ping me. I shall do my best to respond and
address the concerns at the earliest .

Regards,
Raman














On Wed, Sep 30, 2015 at 3:34 AM, Till Westmann <tillw@apache.org> wrote:

> Yes, the parser should just care about syntax.
> Semantic checks should be done in the translator or later.
>
> Cheers,
> Till
>
>
> On 29 Sep 2015, at 14:59, Yingyi Bu wrote:
>
> In ConnectedFeedStatement, a similar piece of code has been commented out.
>> IMO, the AQL parser should just get an AST from a query, but not access
>> the
>> metadata nor do any real work..
>>
>> Best,
>> Yingyi
>>
>> On Tue, Sep 29, 2015 at 2:46 PM, Ian Maxon <imaxon@uci.edu> wrote:
>>
>> I always wondered where that plan's input came from in the CC logs. It
>>> gets
>>> generated during a connect statement as well.
>>>
>>> On Tue, Sep 29, 2015 at 2:04 PM, Mike Carey <dtabass@gmail.com> wrote:
>>>
>>> I wasn't aware of that statement...!
>>>> On Sep 29, 2015 12:17 PM, "Yingyi Bu" <buyingyi@gmail.com> wrote:
>>>>
>>>> All right, I will open an issue for that.
>>>>> Thanks!
>>>>>
>>>>> Best,
>>>>> Yingyi
>>>>>
>>>>> On Tue, Sep 29, 2015 at 12:11 PM, abdullah alamoudi <
>>>>>
>>>> bamousaa@gmail.com>
>>>
>>>> wrote:
>>>>>
>>>>> I am not aware of any special reason and it definitely looks a bit
>>>>>>
>>>>> too
>>>
>>>> hackish to me.
>>>>>> I would say that it needs to be fixed but I don't think it is a
>>>>>>
>>>>> priority
>>>>
>>>>> at this point. Anyway, it is a private command that is not exposed to
>>>>>>
>>>>> the
>>>>
>>>>> end user.
>>>>>>
>>>>>> I would like to know if there is a reason as well.
>>>>>> ~Abdullah.
>>>>>>
>>>>>> Amoudi, Abdullah.
>>>>>>
>>>>>> On Tue, Sep 29, 2015 at 9:53 PM, Yingyi Bu <buyingyi@gmail.com>
>>>>>>
>>>>> wrote:
>>>
>>>>
>>>>>> Does anyone know why SubscribeFeedStatement in asterix-aql needs
to
>>>>>>> access the MetadataManager to form yet-another AQL insert query
>>>>>>>
>>>>>> inside
>>>
>>>> it
>>>>>
>>>>>> and hand that to the AQLParser again?
>>>>>>>
>>>>>>> It seems a bit hackish to me.  Is there a particular reason that
it
>>>>>>>
>>>>>> must
>>>>
>>>>> be done this way?
>>>>>>> Thanks!
>>>>>>>
>>>>>>> Best,
>>>>>>> Yingyi
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>


-- 
Raman

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message