hive-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From László Bodor (Jira) <j...@apache.org>
Subject [jira] [Comment Edited] (HIVE-24111) TestMmCompactorOnTez hangs when running against Tez 0.10.0 staging artifact
Date Fri, 04 Sep 2020 16:40:00 GMT

    [ https://issues.apache.org/jira/browse/HIVE-24111?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17190804#comment-17190804
] 

László Bodor edited comment on HIVE-24111 at 9/4/20, 4:39 PM:
--------------------------------------------------------------

For reference:
logs for a good run:  [^org.apache.hadoop.hive.ql.txn.compactor.TestMmCompactorOnTez-output.txt.log]

logs for a hanging run:  [^TestCrudCompactorTez.log] 

what is strange for the first sight, I cannot see MergeManager related log messages when it's
expected, so this could be a shuffle issue

good run:
{code}
2020-09-03T15:13:19,604  INFO [I/O Setup 0 Start: {Map 1}] orderedgrouped.Shuffle: Map_1:
Shuffle assigned with 1 inputs, codec: None, ifileReadAhead: true
2020-09-03T15:13:19,605  INFO [I/O Setup 0 Start: {Map 1}] orderedgrouped.MergeManager: Map
1: MergerManager: memoryLimit=1278984847, maxSingleShuffleLimit=319746208, mergeThreshold=844130048,
ioSortFactor=10, postMergeMem=0, memToMemMergeOutputsThreshold=10
2020-09-03T15:13:19,605  INFO [I/O Setup 0 Start: {Map 1}] orderedgrouped.ShuffleScheduler:
ShuffleScheduler running for sourceVertex: Map 1 with configuration: maxFetchFailuresBeforeReporting=5,
reportReadErrorImmediately=true, maxFailedUniqueFetches=1, abortFailureLimit=15, maxTaskOutputAtOnce=20,
numFetchers=1, hostFailureFraction=0.2, minFailurePerHost=4, maxAllowedFailedFetchFraction=0.5,
maxStallTimeFraction=0.5, minReqProgressFraction=0.5, checkFailedFetchSinceLastCompletion=true
2020-09-03T15:13:19,606  INFO [I/O Setup 0 Start: {Map 1}] runtime.LogicalIOProcessorRuntimeTask:
Started Input with src edge: Map 1
2020-09-03T15:13:19,606  INFO [TezChild] runtime.LogicalIOProcessorRuntimeTask: AutoStartComplete
2020-09-03T15:13:19,606  INFO [ShuffleAndMergeRunner {Map_1}] orderedgrouped.MergeManager:
Setting merger's parent thread to ShuffleAndMergeRunner {Map_1}
2020-09-03T15:13:19,606  INFO [TezChild] task.TaskRunner2Callable: Running task, taskAttemptId=attempt_1599171197926_0001_1_01_000000_0
2020-09-03T15:13:19,607  INFO [TezTaskEventRouter{attempt_1599171197926_0001_1_01_000000_0}]
orderedgrouped.ShuffleInputEventHandlerOrderedGrouped: Map 1: numDmeEventsSeen=1, numDmeEventsSeenWithNoData=0,
numObsoletionEventsSeen=0
2020-09-03T15:13:19,607  INFO [TezChild] exec.SerializationUtilities: Deserializing ReduceWork
using kryo
2020-09-03T15:13:19,607  INFO [TezChild] exec.Utilities: Deserialized plan (via RPC) - name:
Reducer 2 size: 1.87KB
2020-09-03T15:13:19,607  INFO [TezChild] tez.ObjectCache: Caching key: lbodor_20200903151317_7f539b53-07fb-4bb1-97db-c37d72aba99d_Reducer
2__REDUCE_PLAN__
2020-09-03T15:13:19,607  INFO [TezChild] tez.RecordProcessor: conf class path = []
2020-09-03T15:13:19,608  INFO [TezChild] tez.RecordProcessor: thread class path = []
2020-09-03T15:13:19,608  INFO [Fetcher_O {Map_1} #0] orderedgrouped.MergeManager: close onDiskFile.
State: NumOnDiskFiles=1. Current: path=/Users/lbodor/apache/hive/itests/hive-unit/target/tmp/scratchdir/lbodor/_tez_session_dir/e01fa9d5-36d9-4449-bfa4-d12b5fa290f8/.tez/application_1599171197926_0001_wd/localmode-local-dir/output/attempt_1599171197926_0001_1_00_000000_0_10098/file.out,
len=26
2020-09-03T15:13:19,608  INFO [Fetcher_O {Map_1} #0] ShuffleScheduler.fetch: Completed fetch
for attempt: {0, 0, attempt_1599171197926_0001_1_00_000000_0_10098} to DISK_DIRECT, csize=26,
dsize=22, EndTime=1599171199608, TimeTaken=1, Rate=0.02 MB/s
2020-09-03T15:13:19,608  INFO [Fetcher_O {Map_1} #0] orderedgrouped.ShuffleScheduler: All
inputs fetched for input vertex : Map 1
{code}

hanging run:
{code}
2020-09-04T02:12:16,392  INFO [I/O Setup 0 Start: {Map 1}] orderedgrouped.Shuffle: Map_1:
Shuffle assigned with 1 inputs, codec: None, ifileReadAhead: true
2020-09-04T02:12:16,392  INFO [I/O Setup 0 Start: {Map 1}] orderedgrouped.MergeManager: Map
1: MergerManager: memoryLimit=1278984847, maxSingleShuffleLimit=319746208, mergeThreshold=844130048,
ioSortFactor=10, postMergeMem=0, memToMemMergeOutputsThreshold=10
2020-09-04T02:12:16,394  INFO [I/O Setup 0 Start: {Map 1}] orderedgrouped.ShuffleScheduler:
ShuffleScheduler running for sourceVertex: Map 1 with configuration: maxFetchFailuresBeforeReporting=5,
reportReadErrorImmediately=true, maxFailedUniqueFetches=1, abortFailureLimit=15, maxTaskOutputAtOnce=20,
numFetchers=1, hostFailureFraction=0.2, minFailurePerHost=4, maxAllowedFailedFetchFraction=0.5,
maxStallTimeFraction=0.5, minReqProgressFraction=0.5, checkFailedFetchSinceLastCompletion=true
2020-09-04T02:12:16,398  INFO [I/O Setup 0 Start: {Map 1}] runtime.LogicalIOProcessorRuntimeTask:
Started Input with src edge: Map 1
2020-09-04T02:12:16,398  INFO [TezChild] runtime.LogicalIOProcessorRuntimeTask: AutoStartComplete
2020-09-04T02:12:16,398  INFO [TezChild] task.TaskRunner2Callable: Running task, taskAttemptId=attempt_1599210735282_0001_1_01_000000_0
2020-09-04T02:12:16,416  INFO [TezChild] tez.ReduceRecordProcessor: Waiting for ShuffleInputs
to become ready
{code}

so in the hanging run we reach the point of ShuffleScheduler instantiation, but never reach
this point
{code}
2020-09-03T15:13:19,606  INFO [ShuffleAndMergeRunner {Map_1}] orderedgrouped.MergeManager:
Setting merger's parent thread to ShuffleAndMergeRunner {Map_1}
{code}


was (Author: abstractdog):
For reference:
logs for a good run:  [^org.apache.hadoop.hive.ql.txn.compactor.TestMmCompactorOnTez-output.txt.log]

logs for a hanging run:  [^TestCrudCompactorTez.log] 

what is strange for the first sight, I cannot see MergeManager related log messages when it's
expected, so this could be a shuffle issue

good run:
{code}
2020-09-03T15:13:19,604  INFO [I/O Setup 0 Start: {Map 1}] orderedgrouped.Shuffle: Map_1:
Shuffle assigned with 1 inputs, codec: None, ifileReadAhead: true
2020-09-03T15:13:19,605  INFO [I/O Setup 0 Start: {Map 1}] orderedgrouped.MergeManager: Map
1: MergerManager: memoryLimit=1278984847, maxSingleShuffleLimit=319746208, mergeThreshold=844130048,
ioSortFactor=10, postMergeMem=0, memToMemMergeOutputsThreshold=10
2020-09-03T15:13:19,605  INFO [I/O Setup 0 Start: {Map 1}] orderedgrouped.ShuffleScheduler:
ShuffleScheduler running for sourceVertex: Map 1 with configuration: maxFetchFailuresBeforeReporting=5,
reportReadErrorImmediately=true, maxFailedUniqueFetches=1, abortFailureLimit=15, maxTaskOutputAtOnce=20,
numFetchers=1, hostFailureFraction=0.2, minFailurePerHost=4, maxAllowedFailedFetchFraction=0.5,
maxStallTimeFraction=0.5, minReqProgressFraction=0.5, checkFailedFetchSinceLastCompletion=true
2020-09-03T15:13:19,606  INFO [I/O Setup 0 Start: {Map 1}] runtime.LogicalIOProcessorRuntimeTask:
Started Input with src edge: Map 1
2020-09-03T15:13:19,606  INFO [TezChild] runtime.LogicalIOProcessorRuntimeTask: AutoStartComplete
2020-09-03T15:13:19,606  INFO [ShuffleAndMergeRunner {Map_1}] orderedgrouped.MergeManager:
Setting merger's parent thread to ShuffleAndMergeRunner {Map_1}
2020-09-03T15:13:19,606  INFO [TezChild] task.TaskRunner2Callable: Running task, taskAttemptId=attempt_1599171197926_0001_1_01_000000_0
2020-09-03T15:13:19,607  INFO [TezTaskEventRouter{attempt_1599171197926_0001_1_01_000000_0}]
orderedgrouped.ShuffleInputEventHandlerOrderedGrouped: Map 1: numDmeEventsSeen=1, numDmeEventsSeenWithNoData=0,
numObsoletionEventsSeen=0
2020-09-03T15:13:19,607  INFO [TezChild] exec.SerializationUtilities: Deserializing ReduceWork
using kryo
2020-09-03T15:13:19,607  INFO [TezChild] exec.Utilities: Deserialized plan (via RPC) - name:
Reducer 2 size: 1.87KB
2020-09-03T15:13:19,607  INFO [TezChild] tez.ObjectCache: Caching key: lbodor_20200903151317_7f539b53-07fb-4bb1-97db-c37d72aba99d_Reducer
2__REDUCE_PLAN__
2020-09-03T15:13:19,607  INFO [TezChild] tez.RecordProcessor: conf class path = []
2020-09-03T15:13:19,608  INFO [TezChild] tez.RecordProcessor: thread class path = []
2020-09-03T15:13:19,608  INFO [Fetcher_O {Map_1} #0] orderedgrouped.MergeManager: close onDiskFile.
State: NumOnDiskFiles=1. Current: path=/Users/lbodor/apache/hive/itests/hive-unit/target/tmp/scratchdir/lbodor/_tez_session_dir/e01fa9d5-36d9-4449-bfa4-d12b5fa290f8/.tez/application_1599171197926_0001_wd/localmode-local-dir/output/attempt_1599171197926_0001_1_00_000000_0_10098/file.out,
len=26
2020-09-03T15:13:19,608  INFO [Fetcher_O {Map_1} #0] ShuffleScheduler.fetch: Completed fetch
for attempt: {0, 0, attempt_1599171197926_0001_1_00_000000_0_10098} to DISK_DIRECT, csize=26,
dsize=22, EndTime=1599171199608, TimeTaken=1, Rate=0.02 MB/s
2020-09-03T15:13:19,608  INFO [Fetcher_O {Map_1} #0] orderedgrouped.ShuffleScheduler: All
inputs fetched for input vertex : Map 1
{code}

hanging run:
{code}
2020-09-04T02:12:16,392  INFO [I/O Setup 0 Start: {Map 1}] orderedgrouped.Shuffle: Map_1:
Shuffle assigned with 1 inputs, codec: None, ifileReadAhead: true
2020-09-04T02:12:16,392  INFO [I/O Setup 0 Start: {Map 1}] orderedgrouped.MergeManager: Map
1: MergerManager: memoryLimit=1278984847, maxSingleShuffleLimit=319746208, mergeThreshold=844130048,
ioSortFactor=10, postMergeMem=0, memToMemMergeOutputsThreshold=10
2020-09-04T02:12:16,394  INFO [I/O Setup 0 Start: {Map 1}] orderedgrouped.ShuffleScheduler:
ShuffleScheduler running for sourceVertex: Map 1 with configuration: maxFetchFailuresBeforeReporting=5,
reportReadErrorImmediately=true, maxFailedUniqueFetches=1, abortFailureLimit=15, maxTaskOutputAtOnce=20,
numFetchers=1, hostFailureFraction=0.2, minFailurePerHost=4, maxAllowedFailedFetchFraction=0.5,
maxStallTimeFraction=0.5, minReqProgressFraction=0.5, checkFailedFetchSinceLastCompletion=true
2020-09-04T02:12:16,398  INFO [I/O Setup 0 Start: {Map 1}] runtime.LogicalIOProcessorRuntimeTask:
Started Input with src edge: Map 1
2020-09-04T02:12:16,398  INFO [TezChild] runtime.LogicalIOProcessorRuntimeTask: AutoStartComplete
2020-09-04T02:12:16,398  INFO [TezChild] task.TaskRunner2Callable: Running task, taskAttemptId=attempt_1599210735282_0001_1_01_000000_0
2020-09-04T02:12:16,416  INFO [TezChild] tez.ReduceRecordProcessor: Waiting for ShuffleInputs
to become ready
{code}

so in the hanging run we reach the point of ShuffleScheduler instantiation, but never reach
this point, which is called from ShuffleScheduler.start()
{code}
2020-09-03T15:13:19,606  INFO [ShuffleAndMergeRunner {Map_1}] orderedgrouped.MergeManager:
Setting merger's parent thread to ShuffleAndMergeRunner {Map_1}
{code}

> TestMmCompactorOnTez hangs when running against Tez 0.10.0 staging artifact
> ---------------------------------------------------------------------------
>
>                 Key: HIVE-24111
>                 URL: https://issues.apache.org/jira/browse/HIVE-24111
>             Project: Hive
>          Issue Type: Bug
>            Reporter: László Bodor
>            Assignee: László Bodor
>            Priority: Major
>         Attachments: TestCrudCompactorTez.log, jstack.log, org.apache.hadoop.hive.ql.txn.compactor.TestMmCompactorOnTez-output.txt.log
>
>
> Reproduced issue in ptest run which I made to run against tez staging artifacts (https://repository.apache.org/content/repositories/orgapachetez-1068/)
> http://ci.hive.apache.org/blue/organizations/jenkins/hive-precommit/detail/PR-1311/14/pipeline/417
> I'm about to investigate this. I think Tez 0.10.0 cannot be released until we won't confirm
if it's a hive or tez bug.
> {code}
> mvn test -Pitests,hadoop-2 -Dtest=TestMmCompactorOnTez -pl ./itests/hive-unit
> {code}
> tez setup:
> https://github.com/apache/hive/commit/92516631ab39f39df5d0692f98ac32c2cd320997#diff-a22bcc9ba13b310c7abfee4a57c4b130R83-R97



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Mime
View raw message