asterixdb-notifications mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Chen Li (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (ASTERIXDB-1704) Fuzzy-join query is slow
Date Mon, 24 Oct 2016 03:25:58 GMT

    [ https://issues.apache.org/jira/browse/ASTERIXDB-1704?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15600806#comment-15600806
] 

Chen Li commented on ASTERIXDB-1704:
------------------------------------

[~wangsaeu] If the "hash-group by patch" is the reason of the slowdown, does it mean the previous
version was using a lot of memory?  If so, I think we may be able to reproduce the same performance
by increasing the allocated memory.

> Fuzzy-join query is slow
> ------------------------
>
>                 Key: ASTERIXDB-1704
>                 URL: https://issues.apache.org/jira/browse/ASTERIXDB-1704
>             Project: Apache AsterixDB
>          Issue Type: Bug
>            Reporter: Taewoo Kim
>
> I have an issue regarding the prefix-based fuzzy join (non-index based fuzzy join) on
a small dataset. The following query runs forever even for a dataset with 200K records on
9 nodes. So, each node only has 20,000 records. Also, the record size is not that big. 
> {code}
> count(
> for $o in dataset AmazonReview
> for $i in dataset AmazonReview
> where similarity-jaccard(word-tokens($o.reviewText), word-tokens($i.reviewText)) >=
0.2 and $o.id < $i.id
> return {"oid":$o.reviewrID, "iid":$i.reviewID}
> );
> {code}
> An example record is as follows.  
> {code}
> {
>   "reviewerID": "A2SUAM1J3GNN3B",
>   "asin": "0000013714",
>   "reviewerName": "J. McDonald",
>   "helpful": [2, 3],
>   "reviewText": "I bought this for my husband who plays the piano.  He is having a wonderful
time playing these old hymns.  The music  is at times hard to read because we think the book
was published for singing from more than playing from.  Great purchase though!",
>   "overall": 5.0,
>   "summary": "Heavenly Highway Hymns",
>   "unixReviewTime": 1252800000,
>   "reviewTime": "09 13, 2009"
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message