atlas-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "ASF subversion and git services (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (ATLAS-2891) Incorrect column lineage: each output column has input from *all columns* of the input table
Date Fri, 12 Oct 2018 21:35:00 GMT

    [ https://issues.apache.org/jira/browse/ATLAS-2891?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16648474#comment-16648474
] 

ASF subversion and git services commented on ATLAS-2891:
--------------------------------------------------------

Commit 82e04037220a82ab985928221bc72ade80fc3be2 in atlas's branch refs/heads/master from [~madhan@apache.org]
[ https://git-wip-us.apache.org/repos/asf?p=atlas.git;h=82e0403 ]

ATLAS-2891: updated hook notification processing with option to ignore potentially incorrect
hive_column_lineage - #3


> Incorrect column lineage: each output column has input from *all columns* of the input
table
> --------------------------------------------------------------------------------------------
>
>                 Key: ATLAS-2891
>                 URL: https://issues.apache.org/jira/browse/ATLAS-2891
>             Project: Atlas
>          Issue Type: Bug
>          Components: atlas-intg
>    Affects Versions: 0.8.2
>            Reporter: Madhan Neethiraj
>            Assignee: Madhan Neethiraj
>            Priority: Critical
>             Fix For: 0.8.3
>
>         Attachments: ATLAS-2891-branch-0.8.patch, ATLAS-2891.png
>
>
> Column lineage generated by Atlas Hive hook is incorrect for certain queries - like the
following INSERT:
> {noformat}
> CREATE TABLE source_tbl(col_001 INT, col_002 INT, col_003 INT);
> CREATE TABLE target_tbl(col_001 INT, col_002 INT, col_003 INT);
> INSERT INTO target_tbl SELECT v1.col_001, v1.col_002, v1.col_003 FROM (SELECT col_001,
col_002, col_003, ROW_NUMBER() OVER() AS r_num FROM source_tbl) v1;
> {noformat}
> In this case, lineage for each column in target_tbl shows input from all columns in source_tbl.
In this case, the lineage information provided to post hooks (like Atlas hook) contains 3
entries, one for each column in target_tbl. Note the dependency for each column has all columns
of the source_tbl.
> {noformat}
> DependencyKey=default.target_tbl:FieldSchema(name:col_001, type:int, comment:null)
> Dependency=[SCRIPT]
>            [default.source_tbl(src):FieldSchema(name:col_001, type:int, comment:null),
>             default.source_tbl(src):FieldSchema(name:col_002, type:int, comment:null),
>             default.source_tbl(src):FieldSchema(name:col_003, type:int, comment:null),
>             default.source_tbl(src):FieldSchema(name:BLOCK__OFFSET__INSIDE__FILE, type:bigint,
comment:),
>             default.source_tbl(src):FieldSchema(name:INPUT__FILE__NAME, type:string,
comment:),
>             default.source_tbl(src):FieldSchema(name:ROW__ID, type:struct<transactionId:bigint,bucketId:int,rowId:bigint>,
comment:)
>            ];
>  
> DependencyKey=default.target_tbl:FieldSchema(name:col_002, type:int, comment:null)
> Dependency=[SCRIPT]
>            [default.source_tbl(src):FieldSchema(name:col_001, type:int, comment:null),
>             default.source_tbl(src):FieldSchema(name:col_002, type:int, comment:null),
>             default.source_tbl(src):FieldSchema(name:col_003, type:int, comment:null),
>             default.source_tbl(src):FieldSchema(name:BLOCK__OFFSET__INSIDE__FILE, type:bigint,
comment:),
>             default.source_tbl(src):FieldSchema(name:INPUT__FILE__NAME, type:string,
comment:),
>             default.source_tbl(src):FieldSchema(name:ROW__ID, type:struct<transactionId:bigint,bucketId:int,rowId:bigint>,
comment:)
>            ];
>  
> DependencyKey=default.target_tbl:FieldSchema(name:col_003, type:int, comment:null)
> Dependency=[SCRIPT]
>            [default.source_tbl(src):FieldSchema(name:col_001, type:int, comment:null),
>             default.source_tbl(src):FieldSchema(name:col_002, type:int, comment:null),
>             default.source_tbl(src):FieldSchema(name:col_003, type:int, comment:null),
>             default.source_tbl(src):FieldSchema(name:BLOCK__OFFSET__INSIDE__FILE, type:bigint,
comment:),
>             default.source_tbl(src):FieldSchema(name:INPUT__FILE__NAME, type:string,
comment:),
>             default.source_tbl(src):FieldSchema(name:ROW__ID, type:struct<transactionId:bigint,bucketId:int,rowId:bigint>,
comment:)
>            ];
> {noformat}
> When INSERT statement doesn't include "ROW_NUMBER() OVER() AS r_num", the lineage details
look correct.
> This issue is seen in Hive version 1; but not in Hive2 or Hive3.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Mime
View raw message