asterixdb-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mike Carey <dtab...@gmail.com>
Subject Re: Feeds UDF
Date Wed, 09 Dec 2015 21:24:47 GMT
The function itself takes one record in and produces N records (where 
the normal case is N=1, but you are right, there's nothing to stop it 
from being all of the records in some dataset).  The normal case for a 
join would be something like adding some additional fields to an 
incoming record by matching the record against other datasets - e.g., 
geolocating a Tweet (as Jianfeng mentioned).  Thus, normally one record 
comes in to the function, some processing happens, and a fatter version 
of the record goes out.  Since the record isn't out when the processing 
occurs, it can't see itself during the processing - it doesn't exist in 
stored form yet.  A wierd case would be outputting multiple records for 
one incoming record - and given our loose transaction model, if the 
records are going to be stored in the same data set that's being read by 
the function for a join, the join could perhaps see some of the records 
being generated. HOWEVER:  The function doesn't get run on them - they 
are not coming in the front door of the feed - so if an incoming record 
indeed generates N incoming records in its place, it could see the 
original dataset plus N-1 of the other records.  But that's still not 
infinite.  (Nor is it normal for the join to be a self-join or the 
result to have cardinality > 1. :-))

On 12/9/15 9:48 AM, abdullah alamoudi wrote:
> But if the function actually takes a single record and performs a join
> effectively producing a collection of records that feeds into the same
> dataset, wouldn't that create a chance for this infinite loop that would
> eventually fills up the storage and explodes the dataset?
>
> One thing to note is that in their current implementation, feed connections
> are translated into insert statements that go through the query compiler,
> meaning that a materialize operator will be introduced.
>
> Cheers,
> Abdullah.
>
> Amoudi, Abdullah.
>
> On Wed, Dec 9, 2015 at 9:40 AM, Mike Carey <dtabass@gmail.com> wrote:
>
>> Hmmm....  I'm not sure where the Halloween problem is in this case - for a
>> given record being ingested, it's not in the dataset yet, and won't get to
>> move furrher thru the pipeline to the point where it IS in the data set
>> until after the query evaluation is over, the result has been computed, and
>> the new object (the one to be inserted) has been determined.  At least
>> that's how it should work.  There should thus be no way for the ingestion
>> pipeline query to see a record twice in a self-join scenario, because it
>> won't be in play in the dataset yet (it's not part of "self") - right?  (Or
>> is there a subtlety that I'm missing?)
>>
>> Cheers,
>> Mike
>>
>>
>> On 12/9/15 6:59 AM, abdullah alamoudi wrote:
>>
>>> The only problem I see is the Halloween problem in case of a self join,
>>> hence the need for materialization(not sure if it is possible in this case
>>> but definitely possible in general). Other than that, I don't think there
>>> is any problem.
>>>
>>> Cheers,
>>> Abdullah
>>> On Dec 8, 2015 11:51 PM, "Mike Carey" <dtabass@gmail.com> wrote:
>>>
>>> (I am still completely not seeing a problem here.)
>>>> On 12/8/15 10:20 PM, abdullah alamoudi wrote:
>>>>
>>>> The plan is to mostly use Upsert in the future since we can do some
>>>>> optimizations with it that we can't do with an insert.
>>>>> We should also support deletes as well and probably allow a mix of the
>>>>> three operations within the same feed. This is a work in progress right
>>>>> now
>>>>> but before I go far, I am stabilizing some other parts of the feeds.
>>>>>
>>>>> Cheers,
>>>>> Abdullah.
>>>>>
>>>>>
>>>>> Amoudi, Abdullah.
>>>>>
>>>>> On Tue, Dec 8, 2015 at 10:11 PM, Ildar Absalyamov <
>>>>> ildar.absalyamov@gmail.com> wrote:
>>>>>
>>>>> Abdullah,
>>>>>
>>>>>> OK, now I see what problems it will cause.
>>>>>> Kinda related question: could the feed implement “upsert” semantics,
>>>>>> that
>>>>>> you’ve been working on, instead of “insert” semantics?
>>>>>>
>>>>>> On Dec 8, 2015, at 21:52, abdullah alamoudi <bamousaa@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> I think that we probably should restrict feed applied functions
>>>>>>> somehow
>>>>>>> (needs further thoughts and discussions) and I know for sure
that we
>>>>>>>
>>>>>>> don't.
>>>>>> As for the case you present, I would imagine that it could be allowed
>>>>>>> theoretically but I think everyone sees why it should be disallowed.
>>>>>>>
>>>>>>> One thing to keep in mind is that we introduce a materialize
if the
>>>>>>>
>>>>>>> dataset
>>>>>> was part of an insert pipeline. Now think about how this would work
>>>>>>> with
>>>>>>>
>>>>>>> a
>>>>>> continuous feed. One choice would be that the feed will materialize
all
>>>>>>> records to be inserted and once the feed stops, it would start
>>>>>>> inserting
>>>>>>> them but I still think we should not allow it.
>>>>>>>
>>>>>>> My 2c,
>>>>>>> Any opposing argument?
>>>>>>>
>>>>>>>
>>>>>>> Amoudi, Abdullah.
>>>>>>>
>>>>>>> On Tue, Dec 8, 2015 at 6:28 PM, Ildar Absalyamov <
>>>>>>>
>>>>>>> ildar.absalyamov@gmail.com
>>>>>> wrote:
>>>>>>>> Hi All,
>>>>>>>>
>>>>>>>> As a part of feed ingestion we do allow preprocessing incoming
data
>>>>>>>> with
>>>>>>>> AQL UDFs.
>>>>>>>> I was wondering if we somehow restrict the kind of UDFs that
could be
>>>>>>>> used? Do we allow joins in these UDFs? Especially joins with
the same
>>>>>>>> dataset, which is used for intake. Ex:
>>>>>>>>
>>>>>>>> create type TweetType as open {
>>>>>>>>     id: string,
>>>>>>>>     username : string,
>>>>>>>>     location : string,
>>>>>>>>     text : string,
>>>>>>>>     timestamp : string
>>>>>>>> }
>>>>>>>> create dataset Tweets(TweetType)
>>>>>>>> primary key id;
>>>>>>>> create function feed_processor($x) {
>>>>>>>> for $y in dataset Tweets
>>>>>>>> // self-join with Tweets dataset on some predicate($x, $y)
>>>>>>>> return $y
>>>>>>>> }
>>>>>>>> create feed TweetFeed
>>>>>>>> apply function feed_processor;
>>>>>>>>
>>>>>>>> The query above fails in runtime, but I was wondering if
that
>>>>>>>> theoretically could work at all.
>>>>>>>>
>>>>>>>> Best regards,
>>>>>>>> Ildar
>>>>>>>>
>>>>>>>>
>>>>>>>> Best regards,
>>>>>>>>
>>>>>>> Ildar
>>>>>>
>>>>>>
>>>>>>


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message