hive-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Marta Kuczora (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (HIVE-16784) Missing lineage information when hive.blobstore.optimizations.enabled is true
Date Mon, 29 May 2017 12:03:04 GMT

    [ https://issues.apache.org/jira/browse/HIVE-16784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16028257#comment-16028257
] 

Marta Kuczora edited comment on HIVE-16784 at 5/29/17 12:02 PM:
----------------------------------------------------------------

In the LineageState.setLineage method we get the file sink operator for the path:
{noformat}
  public void setLineage(Path dir, DataContainer dc,
      List<FieldSchema> cols) {
    // First lookup the file sink operator from the load work.
    Operator<?> op = dirToFop.get(dir);

    // Go over the associated fields and look up the dependencies
    // by position in the row schema of the filesink operator.
    if (op == null) {
      return;
    }

    List<ColumnInfo> signature = op.getSchema().getSignature();
    int i = 0;
    for (FieldSchema fs : cols) {
      linfo.putDependency(dc, fs, index.getDependency(op, signature.get(i++)));
    }
  }
{noformat}
The reason why the lineage information is missing from the out file is that the dirToFop map
doesn't contain the given path.
This map is created in the SemanticAnalyzer.genFileSinkPlan method:
{noformat}
    if (ltd != null && SessionState.get() != null) {
      SessionState.get().getLineageState()
          .mapDirToFop(ltd.getSourcePath(), output);
    }
{noformat}
The path used here doesn't match with the patch used in the LineageState.setLineage method.
The difference is in the file name, the map contains the path for the file "-ext-10000", but
the path in the LineageState points to the "-ext-10002" file.


was (Author: kuczoram):
In the LineageState.setLineage method we get the file sink operator for the path:
{noformat}
  public void setLineage(Path dir, DataContainer dc,
      List<FieldSchema> cols) {
    // First lookup the file sink operator from the load work.
    Operator<?> op = dirToFop.get(dir);

    // Go over the associated fields and look up the dependencies
    // by position in the row schema of the filesink operator.
    if (op == null) {
      return;
    }

    List<ColumnInfo> signature = op.getSchema().getSignature();
    int i = 0;
    for (FieldSchema fs : cols) {
      linfo.putDependency(dc, fs, index.getDependency(op, signature.get(i++)));
    }
  }
{noformat}
The reason why the lineage information is missing from the out file is that the dirToFop map
doesn't contain the given path.
This map is created in the SemanticAnalyzer.genFileSinkPlan method:
{noformat}
    if (ltd != null && SessionState.get() != null) {
      SessionState.get().getLineageState()
          .mapDirToFop(ltd.getSourcePath(), (FileSinkOperator) output);
    }
{noformat}
The path used here doesn't match with the patch used in the LineageState.setLineage method.
The difference is in the file name, the map contains the path for the file "-ext-10000", but
the path in the LineageState points to the "-ext-10002" file.

> Missing lineage information when hive.blobstore.optimizations.enabled is true
> -----------------------------------------------------------------------------
>
>                 Key: HIVE-16784
>                 URL: https://issues.apache.org/jira/browse/HIVE-16784
>             Project: Hive
>          Issue Type: Bug
>            Reporter: Marta Kuczora
>
> Running the commands of the add_part_multiple.q test on S3 with hive.blobstore.optimizations.enabled=true
fails because of missing lineage information.
> Running the command on HDFS
> {noformat}
> from src TABLESAMPLE (1 ROWS)
> insert into table add_part_test PARTITION (ds='2010-01-01') select 100,100
> insert into table add_part_test PARTITION (ds='2010-02-01') select 200,200
> insert into table add_part_test PARTITION (ds='2010-03-01') select 400,300
> insert into table add_part_test PARTITION (ds='2010-04-01') select 500,400;
> {noformat}
> results the following posthook outputs 
> {noformat}
> POSTHOOK: Lineage: add_part_test2 PARTITION(ds=2010-01-01).key EXPRESSION []
> POSTHOOK: Lineage: add_part_test2 PARTITION(ds=2010-01-01).value EXPRESSION []
> POSTHOOK: Lineage: add_part_test2 PARTITION(ds=2010-02-01).key EXPRESSION []
> POSTHOOK: Lineage: add_part_test2 PARTITION(ds=2010-02-01).value EXPRESSION []
> POSTHOOK: Lineage: add_part_test2 PARTITION(ds=2010-03-01).key EXPRESSION []
> POSTHOOK: Lineage: add_part_test2 PARTITION(ds=2010-03-01).value EXPRESSION []
> POSTHOOK: Lineage: add_part_test2 PARTITION(ds=2010-04-01).key EXPRESSION []
> POSTHOOK: Lineage: add_part_test2 PARTITION(ds=2010-04-01).value EXPRESSION []
> {noformat}
> These lines are not printed when running the command on the table located in S3.
> If hive.blobstore.optimizations.enabled=false, the lineage information is printed.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Mime
View raw message