asterixdb-notifications mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "ASF subversion and git services (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (ASTERIXDB-1487) Fuzzy select-join on inverted index poses inconsistent results.
Date Fri, 23 Sep 2016 15:46:21 GMT

    [ https://issues.apache.org/jira/browse/ASTERIXDB-1487?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15516800#comment-15516800
] 

ASF subversion and git services commented on ASTERIXDB-1487:
------------------------------------------------------------

Commit 9182b6d18bc44a9fdee1c27a00ef7a66356bfb6c in asterixdb's branch refs/heads/master from
Michael
[ https://git-wip-us.apache.org/repos/asf?p=asterixdb.git;h=9182b6d ]

ASTERIXDB-1487: fix the wrong plan when we prune the selective branch.

1. Add the test case of ASTERIX-1487 with single join branch required.
2. Disable the join branch pruning in case of unnestmap following datasourcescan.
   - We need to prune the join branch when it is NOT required by the upstream operators and
its generated join key is derived from the same DATASOURCE of the other branch.
   - We SHOULD NOT prune the join branch if there exists a selective operator (UNNESTMAP,
LOUNNESTMAP, LIMIT, SELECT) located between the join operator and DATASOURCESCAN.

Change-Id: I1aef69a2278853fd9f8020da6639331b367ed5ad
Reviewed-on: https://asterix-gerrit.ics.uci.edu/1119
Tested-by: Jenkins <jenkins@fulliautomatix.ics.uci.edu>
Integration-Tests: Jenkins <jenkins@fulliautomatix.ics.uci.edu>
Reviewed-by: Yingyi Bu <buyingyi@gmail.com>


> Fuzzy select-join on inverted index poses inconsistent results.
> ---------------------------------------------------------------
>
>                 Key: ASTERIXDB-1487
>                 URL: https://issues.apache.org/jira/browse/ASTERIXDB-1487
>             Project: Apache AsterixDB
>          Issue Type: Bug
>          Components: AsterixDB
>         Environment: MAC 4 cores, 8GB memory. The current master till 3/17/2016.
>            Reporter: Wenhai
>            Assignee: Wenhai
>            Priority: Critical
>         Attachments: csx-small-multi-id.txt, dblp-small-multi-id.txt
>
>
> As shown in below. After we switching the two "for" branches of the fuzzy join over a
select, the results are consistent.
> Schema
> {noformat}
> drop dataverse test if exists;
> create dataverse test;
> use dataverse test;
> create type DBLPNestedType as closed {
>   id: int64,
>   dblpid: string,
>   title: string,
>   authors: string,
>   misc: string
> }
> create type DBLPType as closed {
>   nested: DBLPNestedType
> }
> create type CSXNestedType as closed {
>   id: int64,
>   csxid: string,
>   title: string,
>   authors: string,
>   misc: string
> }
> create type CSXType as closed {
>   nested: CSXNestedType
> }
> create dataset DBLPtmp(DBLPNestedType) primary key id;
> create dataset CSXtmp(CSXNestedType) primary key id;
> create dataset DBLP(DBLPType) primary key nested.id;
> create dataset CSX(CSXType) primary key nested.id;
> use dataverse test;
> load dataset DBLPtmp
> using localfs
> (("path"="asterix_nc1://data/dblp-small/dblp-small-multi-id.txt"),("format"="delimited-text"),("delimiter"=":"),("quote"="\u0000"))
pre-sorted;
> load dataset CSXtmp
> using localfs
> (("path"="asterix_nc1://data/pub-small/csx-small-multi-id.txt"),("format"="delimited-text"),("delimiter"=":"),("quote"="\u0000"));
> insert into dataset DBLP(
>         for $x in dataset DBLPtmp
>         return {
>                 "nested": $x
>         }
> );
> insert into dataset CSX(
>         for $x in dataset CSXtmp
>         return {
>                 "nested": $x
>         }
> );
> {noformat}
> Indexes
> {noformat}
> create index keyword_index on DBLP(nested.title) type keyword; 
> create index keyword_indexdbauhors on DBLP(nested.authors) type keyword;
> create index keyword_indexcsxauthors on CSX(nested.authors) type keyword;
> {noformat}
> The following query
> {noformat}
> use dataverse test;
> set simthresholds '.1'
> let $s := count(
> for $o in dataset DBLP
> for $t in dataset CSX
> where contains($o.nested.title, "System") and word-tokens($o.nested.authors) ~= word-tokens($t.nested.authors)
> return $o
> )
> return $s
> {noformat}
> will return 28, while the query
> {noformat}
> use dataverse test;
> set simthresholds '.1'
> let $s := count(
> for $t in dataset CSX
> for $o in dataset DBLP
> where contains($o.nested.title, "System") and word-tokens($o.nested.authors) ~= word-tokens($t.nested.authors)
> return $o
> )
> return $s
> {noformat}
> will return 3 or pose a error in a big dataset.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message