asterixdb-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jianfeng Jia <jianfeng....@gmail.com>
Subject Re: Searching for duplicates during feed ingestion.
Date Mon, 08 May 2017 20:37:37 GMT
Got the point now…
I would image If the record has a version number that could potentially solve some problems
here. However, it would be a totally difference story then..

> On May 8, 2017, at 12:39 PM, Mike Carey <dtabass@gmail.com> wrote:
> 
> Note that upserts don't avoid searches.... (Still need to get the old record to update
secondary indexes from.)
> 
> 
> On 5/8/17 12:10 PM, Jianfeng Jia wrote:
>> Aha, never knew that before. We will definitely try upsert feed next time! Thanks
for pointing it out!
>> 
>>> On May 8, 2017, at 12:07 PM, Ildar Absalyamov <ildar.absalyamov@gmail.com>
wrote:
>>> 
>>> I believe we already support upsert feeds ;)
>>> https://github.com/apache/asterixdb/blob/master/asterixdb/asterix-app/src/test/resources/runtimets/queries/feeds/upsert-feed/upsert-feed.1.ddl.aql
<https://github.com/apache/asterixdb/blob/master/asterixdb/asterix-app/src/test/resources/runtimets/queries/feeds/upsert-feed/upsert-feed.1.ddl.aql>
>>>> On May 8, 2017, at 12:04, Jianfeng Jia <jianfeng.jia@gmail.com> wrote:
>>>> 
>>>> I also observe this getting slower problem every-time when we re-ingest the
twitter data. One difference is that the duplicate key could happen, and we know that is indeed
duplicate record. To skip the search, we would expect an  “upsert” logic ( just replace
the old one :-) ) instead of an insert.
>>>> 
>>>> Then maybe we can add some configuration in feed configuration like
>>>> 
>>>> create feed MessageFeed using localfs(
>>>> ("format"="adm"),
>>>> ("type-name"="typeX"),
>>>> ("upsert"="true")
>>>> );
>>>> 
>>>> to indicate that this feed using the upsert logic instead of insert.
>>>> 
>>>> One thing we need to confirm is that if “upsert” is actually implemented
in a no-search fashion?
>>>> Based on the way we searching the components, only the most recent one will
be popped out. Then blindly insert should be OK logically. Correct me if I missed some other
cases (highly likely :-)).
>>>> 
>>>> 
>>>>> On May 8, 2017, at 11:05 AM, Mike Carey <dtabass@gmail.com> wrote:
>>>>> 
>>>>> +0.99 from me.
>>>>> 
>>>>> 
>>>>> On 5/8/17 9:50 AM, Taewoo Kim wrote:
>>>>>> +1 for auto-generated ID case
>>>>>> 
>>>>>> Best,
>>>>>> Taewoo
>>>>>> 
>>>>>> On Mon, May 8, 2017 at 8:57 AM, Yingyi Bu <buyingyi@gmail.com>
wrote:
>>>>>> 
>>>>>>> Abdullah has a pending change that disables searches if there's
no
>>>>>>> secondary indexes [1].
>>>>>>> Auto-generated ID could be another case for which we can disable
searches
>>>>>>> as well.
>>>>>>> 
>>>>>>> Best,
>>>>>>> Yingyi
>>>>>>> 
>>>>>>> [1] https://asterix-gerrit.ics.uci.edu/#/c/1711/
>>>>>>> 
>>>>>>> 
>>>>>>> On Mon, May 8, 2017 at 4:30 AM, Wail Alkowaileet <wael.y.k@gmail.com>
>>>>>>> wrote:
>>>>>>> 
>>>>>>>> Hi Devs,
>>>>>>>> 
>>>>>>>> I'm noticing a behavior during the ingestion is that it's
getting slower
>>>>>>> by
>>>>>>>> time. I know that is an expected behavior in LSM-indexes.
But what I'm
>>>>>>>> seeing is that I can notice the drop in ingestion rate roughly
after
>>>>>>> having
>>>>>>>> 10 components (around ~13 GB). That's what I'm not sure if
it's expected?
>>>>>>>> 
>>>>>>>> I tried multiple setups (increasing Memory component size
+
>>>>>>>> max-mergable-component-size). All of which delayed the problem
but not
>>>>>>>> solved it. The only part I've never changed is the bloom-filter
>>>>>>>> false-positive rate (1%). Which I want to investigate next.
>>>>>>>> 
>>>>>>>> So..
>>>>>>>> What I want to suggest is that when the primary key is auto-generated,
>>>>>>> why
>>>>>>>> AsterixDB looks for duplicates? it seems a wasteful operation
to me.
>>>>>>> Also,
>>>>>>>> can we give the user the ability to tell the index that all
keys are
>>>>>>> unique
>>>>>>>> ? I know I should not trust the user .. but in certain cases,
probably
>>>>>>> the
>>>>>>>> user is certain that the key is unique. Or a more elegant
solution can
>>>>>>>> shine in the end :-)
>>>>>>>> 
>>>>>>>> --
>>>>>>>> 
>>>>>>>> *Regards,*
>>>>>>>> Wail Alkowaileet
>>>>>>>> 
>>> Best regards,
>>> Ildar
>>> 
> 


Mime
View raw message