asterixdb-notifications mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Taewoo Kim (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (ASTERIXDB-1704) Fuzzy-join query is slow
Date Sat, 22 Oct 2016 22:26:58 GMT

    [ https://issues.apache.org/jira/browse/ASTERIXDB-1704?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15598598#comment-15598598
] 

Taewoo Kim commented on ASTERIXDB-1704:
---------------------------------------

For the same query with the different threshold (0.9) on the same dataset (200K), it took
8,520 sec (142 min).


> Fuzzy-join query is slow
> ------------------------
>
>                 Key: ASTERIXDB-1704
>                 URL: https://issues.apache.org/jira/browse/ASTERIXDB-1704
>             Project: Apache AsterixDB
>          Issue Type: Bug
>            Reporter: Taewoo Kim
>
> I have an issue regarding the prefix-based fuzzy join (non-index based fuzzy join) on
a small dataset. The following query runs forever even for a dataset with 200K records on
9 nodes. So, each node only has 20,000 records. Also, the record size is not that big. 
> {code}
> count(
> for $o in dataset AmazonReview
> for $i in dataset AmazonReview
> where similarity-jaccard(word-tokens($o.reviewText), word-tokens($i.reviewText)) >=
0.2 and $o.id < $i.id
> return {"oid":$o.reviewrID, "iid":$i.reviewID}
> );
> {code}
> An example record is as follows.  
> {code}
> {
>   "reviewerID": "A2SUAM1J3GNN3B",
>   "asin": "0000013714",
>   "reviewerName": "J. McDonald",
>   "helpful": [2, 3],
>   "reviewText": "I bought this for my husband who plays the piano.  He is having a wonderful
time playing these old hymns.  The music  is at times hard to read because we think the book
was published for singing from more than playing from.  Great purchase though!",
>   "overall": 5.0,
>   "summary": "Heavenly Highway Hymns",
>   "unixReviewTime": 1252800000,
>   "reviewTime": "09 13, 2009"
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message