asterixdb-notifications mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jianfeng Jia (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (ASTERIXDB-1699) Inverted Index fail to match the keyword
Date Mon, 24 Oct 2016 03:54:58 GMT

    [ https://issues.apache.org/jira/browse/ASTERIXDB-1699?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15600855#comment-15600855
] 

Jianfeng Jia commented on ASTERIXDB-1699:
-----------------------------------------

It shouldn't be the problem of the filter. I'm using the no-filter ddl to ingest the data.
It shows up again. It somehow related to the data ingestion order. 

I first ingest the data from Jun 2016 to Sep 2016, then ingest the data from Nov. 2015 to
May 2015. After that, I found the December data is malformed that no of the data had been
ingested. After I fixed the data source, I ingest the Dec. 2015 data again. Then the same
problem happened on Dec 2015 range again. It's just a hint because I tried the similar ingestion
order with a small scale data but it wouldn't reproduce it.

> Inverted Index fail to match the keyword
> ----------------------------------------
>
>                 Key: ASTERIXDB-1699
>                 URL: https://issues.apache.org/jira/browse/ASTERIXDB-1699
>             Project: Apache AsterixDB
>          Issue Type: Bug
>          Components: Storage
>         Environment: master : 4819ea44723b87a68406d248782861cf6e5d3305
>            Reporter: Jianfeng Jia
>            Assignee: Ian Maxon
>
> Not very clear how to reproduce it on a smaller dataset. Here is the symptom: 
> If I run the following query
> {code}
> for $t in dataset twitter.ds_tweet
> where $t.'create_at' >= datetime('2016-10-19T00:00:47.473Z') and $t.'create_at' <
datetime('2016-10-19T00:01:47.473Z') 
> and  /* +skip-index */ similarity-jaccard(word-tokens($t.'text'), word-tokens('sleep'))
> 0.0
> return $t.text
> {code}
> It will return some results
> {code}
> "No point in going to sleep now lol"
> "Can't sleep"
> "TL Sleep ��"
> "i can't sleep man����"
> "Blazed and I still can't sleep fackkkk.."
> "When you're proud of yourself for going to bed in time to get 6 hours of sleep #CollegeLyfeAmIRightIAmIt'sSoCrazyLol"
> "I would be sleep rn but have to lurk bc I'm no sucka & bc the fan isn't working��"
> "Since I can't sleep �� https://t.co/ALZE4psIqP"
> "Wish I Could Sleep"
> "Of course when I go to lay down finally, I am not tired. To sleep or not to sleep??
That's the real question."
> {code}
> If I'm using index
> {code}
> for $t in dataset twitter.ds_tweet
> where $t.'create_at' >= datetime('2016-10-19T00:00:47.473Z') and $t.'create_at' <
datetime('2016-10-19T00:01:47.473Z') 
> and  similarity-jaccard(word-tokens($t.'text'), word-tokens('sleep')) > 0.0
> return $t.text
> {code}
> It returns empty. 
> The debug port is on 8001 on each cloudberry nuc nc. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message