asterixdb-notifications mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Taewoo Kim (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (ASTERIXDB-1556) Prefix-based multi-way Fuzzy-join generates an exception.
Date Thu, 28 Jul 2016 21:02:20 GMT

    [ https://issues.apache.org/jira/browse/ASTERIXDB-1556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15398201#comment-15398201
] 

Taewoo Kim commented on ASTERIXDB-1556:
---------------------------------------

>From Wenhai:

Yep, to be honest, I also think so. As far as I know, the actual cost is up to the threshold
as well as the cardinality difference of the two inputs of the binary-join. Since the underly
binary join feeds its father and the fuzzy join always exhibits a cardinality propagation
input-output, as for the join cost itself, I don't think It's a good idea to set up a prefix
join first and enforce the inverted index after that. I think it's a better way to make the
plan runnable and provide a hint to leave the choice to the user. Afterwards, maybe we can
automatic generate the optimal plan derived from the statistics of the cardinalities of the
both branches. I guess the right optimization way will be most likely to setup inverted index
join firstly (if there exists a very-high selectivity operator below the first fuzzy join)
combining a series of prefix joins.

By the way, in our initial results, as for the 1million CSX join 1million DBLP, the inverted
index is superior the prefix-based join to ONLY WHEN the threshold is about larger than 0.8
and the difference of the both inputs are higher than 10000 (which means the output of the
DBLP selectivity is below 0.01%). Otherwise, the prefix base will almost be superior than
the inverted index join. The similar results hold for ACM join CITE datasets on both their
authors and titles.

In your query template, there exists a partial comparison as well as a fuzzy join on both
join operators. This is a quite advanced topic. In the traditional DBMS, I think the partial
comparison (< > >=, etc) will always be taken as a select after the actual fuzzy
join (to avoid the cross product). To this end, I think the above points also make sense if
we just take the rest parts (besides the fuzzy join) the simple selects. Maybe we can try
more advanced sort-based theta join in the future, but currently, I believe the hint may be
the only choice for your question after we make the prefix-join runnable.

To close, as for this work itself, I think the join cost is more important than the complexity
of query plan. Agree?

> Prefix-based multi-way Fuzzy-join generates an exception.
> ---------------------------------------------------------
>
>                 Key: ASTERIXDB-1556
>                 URL: https://issues.apache.org/jira/browse/ASTERIXDB-1556
>             Project: Apache AsterixDB
>          Issue Type: Bug
>            Reporter: Taewoo Kim
>
> When we enable prefix-based fuzzy-join and apply the multi-way fuzzy-join ( > 2),
the system generates an out-of-memory exception. 
> Since a fuzzy-join is created using 30-40 lines of AQL codes and this AQL is translated
into massive number of operators (more than 200 operators in the plan for a 3-way fuzzy join),
it could generate out-of-memory exception.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message