hive-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From László Bodor (Jira) <j...@apache.org>
Subject [jira] [Updated] (HIVE-22579) ACID v1: covered delta-only splits (without base) should be marked as covered (branch-2)
Date Tue, 10 Dec 2019 20:27:00 GMT

     [ https://issues.apache.org/jira/browse/HIVE-22579?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

László Bodor updated HIVE-22579:
--------------------------------
    Fix Version/s: 2.3.7

> ACID v1: covered delta-only splits (without base) should be marked as covered (branch-2)
> ----------------------------------------------------------------------------------------
>
>                 Key: HIVE-22579
>                 URL: https://issues.apache.org/jira/browse/HIVE-22579
>             Project: Hive
>          Issue Type: Bug
>            Reporter: László Bodor
>            Assignee: László Bodor
>            Priority: Major
>             Fix For: 2.3.7
>
>         Attachments: HIVE-22579.01.branch-2.patch, HIVE-22579.01.branch-2.patch
>
>
> There is a scenario when different SplitGenerator instances try to cover the delta-only
buckets (having no base file) more than once, so there could be multiple OrcSplit instances
generated for the same delta file, causing more tasks to read the same delta file more than
once, causing duplicate records in a simple select star query.
> File structure for a 256 bucket table
> {code}
> drwxrwxrwx   - hive hadoop          0 2019-11-29 15:55 /apps/hive/warehouse/naresh.db/test1/base_0000013
> -rw-r--r--   3 hive hadoop        353 2019-11-29 15:55 /apps/hive/warehouse/naresh.db/test1/base_0000013/bucket_00012
> -rw-r--r--   3 hive hadoop       1642 2019-11-29 15:55 /apps/hive/warehouse/naresh.db/test1/base_0000013/bucket_00140
> drwxrwxrwx   - hive hadoop          0 2019-11-29 15:55 /apps/hive/warehouse/naresh.db/test1/delta_0000014_0000014_0000
> -rwxrwxrwx   3 hive hadoop        348 2019-11-29 15:55 /apps/hive/warehouse/naresh.db/test1/delta_0000014_0000014_0000/bucket_00012
> -rwxrwxrwx   3 hive hadoop       1635 2019-11-29 15:55 /apps/hive/warehouse/naresh.db/test1/delta_0000014_0000014_0000/bucket_00140
> drwxrwxrwx   - hive hadoop          0 2019-11-29 16:04 /apps/hive/warehouse/naresh.db/test1/delta_0000015_0000015_0000
> -rwxrwxrwx   3 hive hadoop        348 2019-11-29 16:04 /apps/hive/warehouse/naresh.db/test1/delta_0000015_0000015_0000/bucket_00012
> -rwxrwxrwx   3 hive hadoop       1808 2019-11-29 16:04 /apps/hive/warehouse/naresh.db/test1/delta_0000015_0000015_0000/bucket_00140
> drwxrwxrwx   - hive hadoop          0 2019-11-29 16:06 /apps/hive/warehouse/naresh.db/test1/delta_0000016_0000016_0000
> -rwxrwxrwx   3 hive hadoop        348 2019-11-29 16:06 /apps/hive/warehouse/naresh.db/test1/delta_0000016_0000016_0000/bucket_00043
> -rwxrwxrwx   3 hive hadoop       1633 2019-11-29 16:06 /apps/hive/warehouse/naresh.db/test1/delta_0000016_0000016_0000/bucket_00171
> {code}
> in this case, when bucket_00171 file has a record, and there is no base file for that,
a select (*) with ETL split strategy can generate 2 splits for the same delta bucket...
> the scenario of the issue:
> 1. ETLSplitStrategy contains a [covered[] array|https://github.com/apache/hive/blob/298f749ec7be04abb797fb119f3f0b923c8a1b27/ql/src/java/org/apache/hadoop/hive/ql/io/orc/OrcInputFormat.java#L763]
which is [shared between the SplitInfo instances|https://github.com/apache/hive/blob/298f749ec7be04abb797fb119f3f0b923c8a1b27/ql/src/java/org/apache/hadoop/hive/ql/io/orc/OrcInputFormat.java#L824]
to be created
> 2. a SplitInfo instance is created for [every base file (2 in this case)|https://github.com/apache/hive/blob/298f749ec7be04abb797fb119f3f0b923c8a1b27/ql/src/java/org/apache/hadoop/hive/ql/io/orc/OrcInputFormat.java#L809]
> 3. for every SplitInfo, [a SplitGenerator is created|https://github.com/apache/hive/blob/298f749ec7be04abb797fb119f3f0b923c8a1b27/ql/src/java/org/apache/hadoop/hive/ql/io/orc/OrcInputFormat.java#L925-L926],
and in the constructor, [parent's getSplit is called|https://github.com/apache/hive/blob/298f749ec7be04abb797fb119f3f0b923c8a1b27/ql/src/java/org/apache/hadoop/hive/ql/io/orc/OrcInputFormat.java#L1251],
which tries to take care of the deltas
> I'm not sure at the moment what's the intention of this, but this way, duplicated delta
split can be generated, which can cause duplicated read later (note that both tasks read the
same delta file: bucket_00171)
> {code}
> 2019-12-01T16:24:53,669  INFO [TezTR-127843_16_30_0_171_0 (1575040127843_0016_30_00_000171_0)]
orc.ReaderImpl: Reading ORC rows from hdfs://c3351-node2.squadron.support.hortonworks.com:8020/apps/hive/warehouse/naresh.db/test1/delta_0000016_0000016_0000/bucket_00171
with {include: [true, true, true, true, true, true, true, true, true, true, true, true], offset:
0, length: 9223372036854775807, schema: struct<idp_warehouse_id:bigint,idp_audit_id:bigint,batch_id:decimal(9,0),source_system_cd:varchar(500),insert_time:timestamp,process_status_cd:varchar(20),business_date:date,last_update_time:timestamp,report_date:date,etl_run_time:timestamp,etl_run_nbr:bigint>}
> 2019-12-01T16:24:53,672  INFO [TezTR-127843_16_30_0_171_0 (1575040127843_0016_30_00_000171_0)]
lib.MRReaderMapred: Processing split: org.apache.hadoop.hive.ql.io.orc.OrcInputFormat:OrcSplit
[hdfs://c3351-node2.squadron.support.hortonworks.com:8020/apps/hive/warehouse/naresh.db/test1,
start=171, length=0, isOriginal=false, fileLength=9223372036854775807, hasFooter=false, hasBase=false,
deltas=[{ minTxnId: 14 maxTxnId: 14 stmtIds: [0] }, { minTxnId: 15 maxTxnId: 15 stmtIds: [0]
}, { minTxnId: 16 maxTxnId: 16 stmtIds: [0] }]]
> 2019-12-01T16:24:55,807  INFO [TezTR-127843_16_30_0_425_0 (1575040127843_0016_30_00_000425_0)]
orc.ReaderImpl: Reading ORC rows from hdfs://c3351-node2.squadron.support.hortonworks.com:8020/apps/hive/warehouse/naresh.db/test1/delta_0000016_0000016_0000/bucket_00171
with {include: [true, true, true, true, true, true, true, true, true, true, true, true], offset:
0, length: 9223372036854775807, schema: struct<idp_warehouse_id:bigint,idp_audit_id:bigint,batch_id:decimal(9,0),source_system_cd:varchar(500),insert_time:timestamp,process_status_cd:varchar(20),business_date:date,last_update_time:timestamp,report_date:date,etl_run_time:timestamp,etl_run_nbr:bigint>}
> 2019-12-01T16:24:55,813  INFO [TezTR-127843_16_30_0_425_0 (1575040127843_0016_30_00_000425_0)]
lib.MRReaderMapred: Processing split: org.apache.hadoop.hive.ql.io.orc.OrcInputFormat:OrcSplit
[hdfs://c3351-node2.squadron.support.hortonworks.com:8020/apps/hive/warehouse/naresh.db/test1,
start=171, length=0, isOriginal=false, fileLength=9223372036854775807, hasFooter=false, hasBase=false,
deltas=[{ minTxnId: 14 maxTxnId: 14 stmtIds: [0] }, { minTxnId: 15 maxTxnId: 15 stmtIds: [0]
}, { minTxnId: 16 maxTxnId: 16 stmtIds: [0] }]]
> {code}
> seems like this issue doesn't affect AcidV2, as getSplits() returns an empty collection
or throws an exception in case of unexpected deltas (which was the case here, where deltas
was not unexpected):
> https://github.com/apache/hive/blob/8ee3497f87f81fa84ee1023e891dc54087c2cd5e/ql/src/java/org/apache/hadoop/hive/ql/io/orc/OrcInputFormat.java#L1178-L1197



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Mime
View raw message