asterixdb-notifications mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Wenhai Li (Code Review)" <do-not-re...@asterixdb.incubator.apache.org>
Subject Change in asterixdb[master]: Applied the multiway fuzzyjoin based on the prefix-based joi...
Date Mon, 07 Nov 2016 16:23:42 GMT
Wenhai Li has posted comments on this change.

Change subject: Applied the multiway fuzzyjoin based on the prefix-based join and the selectFuzzyJoin
testCases.
......................................................................


Patch Set 21:

(10 comments)

@Taewoo,

Sorry, I did not know you cann't see the comments without publishing. :)

OK, published. Maybe we can talk with some detail with running example. You know, for those
specified detail, these pessimistic methods can always forbid the exception of the dynamic
variables generation.

https://asterix-gerrit.ics.uci.edu/#/c/1076/21/asterixdb/asterix-algebra/src/main/java/org/apache/asterix/optimizer/rules/FuzzyJoinRule.java
File asterixdb/asterix-algebra/src/main/java/org/apache/asterix/optimizer/rules/FuzzyJoinRule.java:

Line 197:         // To handle multiple fuzzyjoin conditions on the same table pair, this
rule differentiate the PKs
> What do we mean by [the same table pair] here?
Query example:
use dataverse fuzzytest;
for $d in dataset DBLP
for $t in dataset CSX
where word-tokens($d.title) ~= word-tokens($t.title) and word-tokens($d.authors) ~= word-tokens($t.authors)
return {"did": $d.tid, "tid": $t.tid}

Explain in general:

The first round has the following functional dependencies:
1. $d.title -> $d.tid
2. $t.title -> $t.tid
which means $d.title is derived from table $d and $t.title is derived from $t, respectively.

After iteration, in the second round, we have the following functional dependencies:
1. $d.authors -> $d.tid
2. $t.authors -> $t.tid
the both right parts have been maintained in the previousPK.

Result:
In this context, we just give two fuzzy join condition on a same table pair, and the second
fuzzy join SHOULD be explained as a fuzzy select based on the result of the first fuzzy join.

Handle strategy:
Just omit the second fuzzy join, other than explain it as another fuzzy join to avoid the
wrong substitution based on the fixed template. (Since we have substituted the two table branches
in the first fuzzy join.)


Line 207:         Set<LogicalVariable> currentPK = new HashSet<>();
> I'm confused about currentPK and previousPK concept. Can you explain more?
In general, each round of potential substitution will scan all its branch variables to look
forward where are they coming from.

currentPK is the primary key of all the primary keys of the current ~='s branches.

previousPK is the primary key of all the primary keys of the scanned/substituted ~='s branches.

If they are equal, we claim it's the duplicated fuzzyjoin conditions on a same table pair.

i.e.

use dataverse fuzzytest; for $d in dataset DBLP for $t in dataset CSX where word-tokens($d.title)
~= word-tokens($t.title) and word-tokens($d.authors) ~= word-tokens($t.authors) return {"did":
$d.tid, "tid": $t.tid}

$d.title and $t.title as well as their PKs are the previous derivations, and $d.authors and
$t.authors are the current derivations.


Line 210:         // If PKs derived from the both branches are SAME as a previous fuzzyjoin,
we treat this ~= as a select.
> Here, "previous fuzzy join" means? Can you present an example?
Reference the comment on 207's word-tokens($d.title) ~= word-tokens($t.title).


Line 251:         ConstantExpression constExpr = (ConstantExpression) inputExp2;
> The reason of this change - not using FuzzyUtils.getSimThreshold()?
At least one case involved:

similarity-jaccard() <> threshold

to get threshold, I think FuzzyUtils.getSimThreshold is not enough.


Line 268:                 break;
> Have we fixed the bug that mentioned in the previous TODO? Can we explain m
If only permuting the three for clauses in the mentioned testCase, the results in this code-branch
are consistent. Also, if we change the join conditions in this query, I think it's not an
issues, but a semantic problem. I guess the old issues as commented left-red is derived from
the flatten process or order issue. But anyway, it's disappeared in this branch on current
master.


Line 317:         translator.addVariableToMetaScope(new VarIdentifier("$$LEFT_0"), leftInputVar);
> What's the difference between # and $$? I think I saw this in the Vernica's
Also, this issue is derived from the new master at about one year ago. In short, "#" is for
operator and "$$" is for vars. In addition, the translator will be triggered several times,
each round for a legal ~= (a currentPK is not the same one of a previous PKs sets). You know,
we need to increment the vars counter in each round after we generate new vars for the substituting
branches' vars. As well as the following line 356, we can thus generate identical vars for
all rounds of var generation requests.


Line 329:         // Step3.3. the suffix 0-3 is used for identifying the different level of
variable references.
> Can you present an example? different levels?
Nothing special, it is just for the anchor "#LEFT_1" in line 90 of the AQL template, to generate
the vars for this anchor.


Line 356:         counter.set(counter.get() + incrementedCounter);
> How is this counter used?
Refer to the comments in line 317.


Line 407:     // of expRef, we need to add the full condition expRef\getItemExprRef into the
top-level operator of the plan.
> Can you present an example here?
use dataverse fuzzytest;

for $d in dataset DBLP 
for $t in dataset CSX 
for $r in dataset ACM
where word-tokens($d.title) ~= word-tokens($t.title) 
and $d.year < $t.year
and word-tokens($t.authors) ~= word-tokens($r.authors)
and $t.year < $r.year
return {"did": $d.tid, "tid": $t.tid, "rid": $r.tid}

Here, $t.year < $r.year will be pushed on the new topJoinOp of the second fuzzy join.

In general, this method is to extract the extra conditions besides the fuzzy join onto the
new topJoinOp of the substituted plan.


Line 426:         topJoin.getCondition().setValue(andFunc);
> Why is this required for left-outer-join?
I think directly applying Select above loj is not equal to inline the extra condition within
the join condition, right?


-- 
To view, visit https://asterix-gerrit.ics.uci.edu/1076
To unsubscribe, visit https://asterix-gerrit.ics.uci.edu/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I8736f104905eeda763d39709e002c2b9629278cc
Gerrit-PatchSet: 21
Gerrit-Project: asterixdb
Gerrit-Branch: master
Gerrit-Owner: Wenhai Li <lwhaymail@yahoo.com>
Gerrit-Reviewer: Chen Li <chenli@gmail.com>
Gerrit-Reviewer: Jenkins <jenkins@fulliautomatix.ics.uci.edu>
Gerrit-Reviewer: Taewoo Kim <wangsaeu@yahoo.com>
Gerrit-Reviewer: Wenhai Li <lwhaymail@yahoo.com>
Gerrit-HasComments: Yes

Mime
View raw message