asterixdb-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mike Carey <dtab...@gmail.com>
Subject Re: Feeds UDF
Date Wed, 09 Dec 2015 07:50:01 GMT
+1

Again:  The semantics are supposed to be ONE RECORD AT A TIME. It's not 
a join between the incoming stream and the dataset(s) in the query, it's 
just a function being applied to one single individual (insert more 
"one" emphatic words here) record.  I.e., it is NOT a continuous join.  
It's just a function call whose semantics should be identical to calling 
the function on a constant constructed record formed from the next 
record in the input stream.

Cheers,
Mike


On 12/8/15 11:08 PM, Jianfeng Jia wrote:
> I would think this kind of stream record join with “static” data use case is very
common. Take geo tagging as an example, for each of the incoming tweet I’d like to know
the text format place of that geo coordinates. In theory I’d expect the following left-outerjoin
function will work, but currently it’s not working.
>
> create function addAddress($t) {
> "tid": $t.tweetid,
> "user": $t.user,
> "sender-location":$t.sender-location,
> "send-time": $t.send-time,
> "referred-topics":$t.referred-topics,
> "message-text":$t.message-text,
> “place": for $x in dataset USCounty
>      where spatial-intersect($t.sender-location, $x.geometry) return $x.State-County
> }
>
> Basically, what user expects is an lookup. We don’t have to stop the feed and operate
the join. The result will be the same as static join if we lookup the result per record.
> However, in performance consideration, it has to be done in a micro batch way. Maybe
we can pass some hint to “apply” the function per 50 feeds.
>
> create feed TweetFeed
> apply function addAddress per batch(50);
>
> Or we can allocate some buffer to the feed, and apply functions when the buffer is full.
>
> The streaming world even handling with join two streams under some window boundaries.
I think we should not restrict applying functions to our feed.
>
> One thought is that we should refine the "create function” API to more finer granularity.
> Like,
>
> ——— currently support -----
> create filter function A(record)
> create transform function B(record)
> create project function C(record)
> —— TODO
> create lookup function D( record, dataset)
> create reduce function E([records])
> ...
>
>
>
>> On Dec 8, 2015, at 9:52 PM, abdullah alamoudi <bamousaa@gmail.com> wrote:
>>
>> I think that we probably should restrict feed applied functions somehow
>> (needs further thoughts and discussions) and I know for sure that we don't.
>> As for the case you present, I would imagine that it could be allowed
>> theoretically but I think everyone sees why it should be disallowed.
>>
>> One thing to keep in mind is that we introduce a materialize if the dataset
>> was part of an insert pipeline. Now think about how this would work with a
>> continuous feed. One choice would be that the feed will materialize all
>> records to be inserted and once the feed stops, it would start inserting
>> them but I still think we should not allow it.
>>
>> My 2c,
>> Any opposing argument?
>>
>>
>> Amoudi, Abdullah.
>>
>> On Tue, Dec 8, 2015 at 6:28 PM, Ildar Absalyamov <ildar.absalyamov@gmail.com
>>> wrote:
>>> Hi All,
>>>
>>> As a part of feed ingestion we do allow preprocessing incoming data with
>>> AQL UDFs.
>>> I was wondering if we somehow restrict the kind of UDFs that could be
>>> used? Do we allow joins in these UDFs? Especially joins with the same
>>> dataset, which is used for intake. Ex:
>>>
>>> create type TweetType as open {
>>>   id: string,
>>>   username : string,
>>>   location : string,
>>>   text : string,
>>>   timestamp : string
>>> }
>>> create dataset Tweets(TweetType)
>>> primary key id;
>>> create function feed_processor($x) {
>>> for $y in dataset Tweets
>>> // self-join with Tweets dataset on some predicate($x, $y)
>>> return $y
>>> }
>>> create feed TweetFeed
>>> apply function feed_processor;
>>>
>>> The query above fails in runtime, but I was wondering if that
>>> theoretically could work at all.
>>>
>>> Best regards,
>>> Ildar
>>>
>>>
>
>
> Best,
>
> Jianfeng Jia
> PhD Candidate of Computer Science
> University of California, Irvine
>


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message