asterixdb-notifications mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jianfeng Jia (JIRA)" <j...@apache.org>
Subject [jira] [Created] (ASTERIXDB-1699) Inverted Index fail to match the keyword
Date Wed, 19 Oct 2016 18:17:58 GMT
Jianfeng Jia created ASTERIXDB-1699:
---------------------------------------

             Summary: Inverted Index fail to match the keyword
                 Key: ASTERIXDB-1699
                 URL: https://issues.apache.org/jira/browse/ASTERIXDB-1699
             Project: Apache AsterixDB
          Issue Type: Bug
          Components: Storage
         Environment: master : 4819ea44723b87a68406d248782861cf6e5d3305
            Reporter: Jianfeng Jia
            Assignee: Ian Maxon


Not very clear how to reproduce it on a smaller dataset. Here is the symptom: 

If I run the following query
{code}
for $t in dataset twitter.ds_tweet

where $t.'create_at' >= datetime('2016-10-19T00:00:47.473Z') and $t.'create_at' < datetime('2016-10-19T00:01:47.473Z')

and  /* +skip-index */ similarity-jaccard(word-tokens($t.'text'), word-tokens('sleep')) >
0.0
return $t.text

{code}

It will return some results
{code}
"No point in going to sleep now lol"
"Can't sleep"
"TL Sleep ��"
"i can't sleep man����"
"Blazed and I still can't sleep fackkkk.."
"When you're proud of yourself for going to bed in time to get 6 hours of sleep #CollegeLyfeAmIRightIAmIt'sSoCrazyLol"
"I would be sleep rn but have to lurk bc I'm no sucka & bc the fan isn't working��"
"Since I can't sleep �� https://t.co/ALZE4psIqP"
"Wish I Could Sleep"
"Of course when I go to lay down finally, I am not tired. To sleep or not to sleep?? That's
the real question."
{code}

If I'm using index

{code}
for $t in dataset twitter.ds_tweet

where $t.'create_at' >= datetime('2016-10-19T00:00:47.473Z') and $t.'create_at' < datetime('2016-10-19T00:01:47.473Z')

and  similarity-jaccard(word-tokens($t.'text'), word-tokens('sleep')) > 0.0
return $t.text

{code}

It returns empty. 

The debug port is on 8001 on each cloudberry nuc nc. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message