tez-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "TezQA (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (TEZ-3976) ShuffleManager reporting too many errors
Date Mon, 20 Aug 2018 19:35:00 GMT

    [ https://issues.apache.org/jira/browse/TEZ-3976?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16586399#comment-16586399
] 

TezQA commented on TEZ-3976:
----------------------------

{color:green}+1 overall{color}.  Here are the results of testing the latest attachment
  http://issues.apache.org/jira/secure/attachment/12936304/TEZ-3976.2.patch
  against master revision 4b9a7be.

    {color:green}+1 @author{color}.  The patch does not contain any @author tags.

    {color:green}+1 tests included{color}.  The patch appears to include 1 new or modified
test files.

    {color:green}+1 javac{color}.  The applied patch does not increase the total number of
javac compiler warnings.

    {color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

    {color:green}+1 findbugs{color}.  The patch does not introduce any new Findbugs (version
3.0.1) warnings.

    {color:green}+1 release audit{color}.  The applied patch does not increase the total number
of release audit warnings.

    {color:green}+1 core tests{color}.  The patch passed unit tests in .

Test results: https://builds.apache.org/job/PreCommit-TEZ-Build/2895//testReport/
Console output: https://builds.apache.org/job/PreCommit-TEZ-Build/2895//console

This message is automatically generated.


> ShuffleManager reporting too many errors
> ----------------------------------------
>
>                 Key: TEZ-3976
>                 URL: https://issues.apache.org/jira/browse/TEZ-3976
>             Project: Apache Tez
>          Issue Type: Bug
>            Reporter: Jaume M
>            Assignee: Jaume M
>            Priority: Major
>         Attachments: TEZ-3976.1.patch, TEZ-3976.2.patch
>
>
> The symptoms are a lot of these logs are being shown:
> {code:java}
> 2018-06-15T18:09:35,811 INFO  [Fetcher_B {Reducer_5} #0 ()] org.apache.tez.runtime.library.common.shuffle.impl.ShuffleManager:
Reducer_5: Fetch failed for src: InputAttemptIdentifier [inputIdentifier=701, attemptNumber=0,
pathComponent=attempt_1529044441963_0021_34_01_000701_0_12541_0, spillType=2, spillId=0]InputIdentifier:
InputAttemptIdentifier [inputIdentifier=701, attemptNumber=0, pathComponent=attempt_1529044441963_0021_34_01_000701_0_12541_0,
spillType=2, spillId=0], connectFailed: true
> 2018-06-15T18:09:35,811 WARN  [Fetcher_B {Reducer_5} #1 ()] org.apache.tez.runtime.library.common.shuffle.Fetcher:
copyInputs failed for tasks [InputAttemptIdentifier [inputIdentifier=589, attemptNumber=0,
pathComponent=attempt_1529044441963_0021_34_01_000589_0_12445_0, spillType=2, spillId=0]]
> 2018-06-15T18:09:35,811 INFO  [Fetcher_B {Reducer_5} #1 ()] org.apache.tez.runtime.library.common.shuffle.impl.ShuffleManager:
Reducer_5: Fetch failed for src: InputAttemptIdentifier [inputIdentifier=589, attemptNumber=0,
pathComponent=attempt_1529044441963_0021_34_01_000589_0_12445_0, spillType=2, spillId=0]InputIdentifier:
InputAttemptIdentifier [inputIdentifier=589, attemptNumber=0, pathComponent=attempt_1529044441963_0021_34_01_000589_0_12445_0,
spillType=2, spillId=0], connectFailed: true
> {code}
> Each of those translate into an event in the AM which finally crashes due to OOM after
around 30 minutes and around 10 million shuffle input errors (and 10 million lines like the
previous ones). When the ShufflerManager is closed and the counters reported there are many
shuffle input errors, some of those logs are:
> {code:java}
> 2018-06-15T17:46:30,988  INFO [TezTR-441963_21_34_4_0_4 (1529044441963_0021_34_04_000000_4)]
runtime.LogicalIOProcessorRuntimeTask: Final Counters for attempt_1529044441963_0021_34_04_000000_4:
Counters: 43 [[org.apache.tez.common.counters.TaskCounter SPILLED_RECORDS=0, NUM_SHUFFLED_INPUTS=26,
NUM_FAILED_SHUFFLE_INPUTS=858965, INPUT_RECORDS_PROCESSED=26, OUTPUT_RECORDS=1, OUTPUT_LARGE_RECORDS=0,
OUTPUT_BYTES=779472, OUTPUT_BYTES_WITH_OVERHEAD=779483, OUTPUT_BYTES_PHYSICAL=780146, ADDITIONAL_SPILLS_BYTES_WRITTEN=0,
ADDITIONAL_SPILLS_BYTES_READ=0, ADDITIONAL_SPILL_COUNT=0, SHUFFLE_BYTES=4207563, SHUFFLE_BYTES_DECOMPRESSED=20266603,
SHUFFLE_BYTES_TO_MEM=3380616, SHUFFLE_BYTES_TO_DISK=0, SHUFFLE_BYTES_DISK_DIRECT=826947, SHUFFLE_PHASE_TIME=52516,
FIRST_EVENT_RECEIVED=1, LAST_EVENT_RECEIVED=1185][HIVE RECORDS_OUT_INTERMEDIATE_^[[1;35;40m^[[KReducer_12^[[m^[[K=1,
RECORDS_OUT_OPERATOR_GBY_159=1, RECORDS_OUT_OPERATOR_RS_160=1][TaskCounter_^[[1;35;40m^[[KReducer_12^[[m^[[K_INPUT_Map_11
FIRST_EVENT_RECEIVED=1, INPUT_RECORDS_PROCESSED=26, LAST_EVENT_RECEIVED=1185, NUM_FAILED_SHUFFLE_INPUTS=858965,
NUM_SHUFFLED_INPUTS=26, SHUFFLE_BYTES=4207563, SHUFFLE_BYTES_DECOMPRESSED=20266603, SHUFFLE_BYTES_DISK_DIRECT=826947,
SHUFFLE_BYTES_TO_DISK=0, SHUFFLE_BYTES_TO_MEM=3380616, SHUFFLE_PHASE_TIME=52516][TaskCounter_^[[1;35;40m^[[KReducer_12^[[m^[[K_OUTPUT_Map_1
ADDITIONAL_SPILLS_BYTES_READ=0, ADDITIONAL_SPILLS_BYTES_WRITTEN=0, ADDITIONAL_SPILL_COUNT=0,
OUTPUT_BYTES=779472, OUTPUT_BYTES_PHYSICAL=780146, OUTPUT_BYTES_WITH_OVERHEAD=779483, OUTPUT_LARGE_RECORDS=0,
OUTPUT_RECORDS=1, SPILLED_RECORDS=0]]
> 2018-06-15T17:46:32,271 INFO  [TezTR-441963_21_34_3_15_1 ()] org.apache.tez.runtime.LogicalIOProcessorRuntimeTask:
Final Counters for attempt_1529044441963_0021_34_03_000015_1: Counters: 87 [[File System Counters
FILE_BYTES_READ=0, FILE_BYTES_WRITTEN=0, FILE_READ_OPS=0, FILE_LARGE_READ_OPS=0, FILE_WRITE_OPS=0,
HDFS_BYTES_READ=2344929, HDFS_BYTES_WRITTEN=0, HDFS_READ_OPS=5, HDFS_LARGE_READ_OPS=0, HDFS_WRITE_OPS=0][org.apache.tez.common.counters.TaskCounter
SPILLED_RECORDS=0, NUM_SHUFFLED_INPUTS=1, NUM_FAILED_SHUFFLE_INPUTS=105195, INPUT_RECORDS_PROCESSED=397,
INPUT_SPLIT_LENGTH_BYTES=21563271, OUTPUT_RECORDS=15737, OUTPUT_LARGE_RECORDS=0, OUTPUT_BYTES=1235818,
OUTPUT_BYTES_WITH_OVERHEAD=1267307, OUTPUT_BYTES_PHYSICAL=357520, ADDITIONAL_SPILLS_BYTES_WRITTEN=0,
ADDITIONAL_SPILLS_BYTES_READ=0, ADDITIONAL_SPILL_COUNT=0, SHUFFLE_BYTES=31, SHUFFLE_BYTES_DECOMPRESSED=17,
SHUFFLE_BYTES_TO_MEM=31, SHUFFLE_BYTES_TO_DISK=0, SHUFFLE_BYTES_DISK_DIRECT=0, SHUFFLE_PHASE_TIME=50525,
FIRST_EVENT_RECEIVED=9, LAST_EVENT_RECEIVED=61][HIVE DESERIALIZE_ERRORS=0, RECORDS_IN_Map_11=395611,
RECORDS_OUT_INTERMEDIATE_Map_11=15737, RECORDS_OUT_OPERATOR_FIL_152=395611, RECORDS_OUT_OPERATOR_GBY_157=1,
RECORDS_OUT_OPERATOR_MAPJOIN_154=15736, RECORDS_OUT_OPERATOR_MAP_0=0, RECORDS_OUT_OPERATOR_RS_155=15736,
RECORDS_OUT_OPERATOR_RS_158=1, RECORDS_OUT_OPERATOR_SEL_153=395611, RECORDS_OUT_OPERATOR_SEL_156=15736,
RECORDS_OUT_OPERATOR_TS_26=395611][TaskCounter_Map_11_INPUT_Map_13 FIRST_EVENT_RECEIVED=9,
INPUT_RECORDS_PROCESSED=1, LAST_EVENT_RECEIVED=61, NUM_FAILED_SHUFFLE_INPUTS=105195, NUM_SHUFFLED_INPUTS=1,
SHUFFLE_BYTES=31, SHUFFLE_BYTES_DECOMPRESSED=17, SHUFFLE_BYTES_DISK_DIRECT=0, SHUFFLE_BYTES_TO_DISK=0,
SHUFFLE_BYTES_TO_MEM=31, SHUFFLE_PHASE_TIME=50525][TaskCounter_Map_11_INPUT_supplier INPUT_RECORDS_PROCESSED=396,
INPUT_SPLIT_LENGTH_BYTES=21563271][TaskCounter_Map_11_OUTPUT_Reducer_12 ADDITIONAL_SPILLS_BYTES_READ=0,
ADDITIONAL_SPILLS_BYTES_WRITTEN=0, ADDITIONAL_SPILL_COUNT=0, OUTPUT_BYTES=779474, OUTPUT_BYTES_PHYSICAL=164787,
OUTPUT_BYTES_WITH_OVERHEAD=779485, OUTPUT_LARGE_RECORDS=0, OUTPUT_RECORDS=1, SPILLED_RECORDS=0][TaskCounter_Map_11_OUTPUT_Reducer_6
ADDITIONAL_SPILLS_BYTES_READ=0, ADDITIONAL_SPILLS_BYTES_WRITTEN=0, ADDITIONAL_SPILL_COUNT=0,
OUTPUT_BYTES=456344, OUTPUT_BYTES_PHYSICAL=192733, OUTPUT_BYTES_WITH_OVERHEAD=487822, OUTPUT_LARGE_RECORDS=0,
OUTPUT_RECORDS=15736, SPILLED_RECORDS=0][org.apache.hadoop.hive.llap.counters.LlapIOCounters
ALLOCATED_BYTES=9633792, ALLOCATED_USED_BYTES=7976706, CACHE_HIT_BYTES=0, CACHE_MISS_BYTES=2344364,
CONSUMER_TIME_NS=1136392475, DECODE_TIME_NS=140377915, HDFS_TIME_NS=145825282, METADATA_CACHE_MISS=4,
NUM_DECODED_BATCHES=41, NUM_VECTOR_BATCHES=396, ROWS_EMITTED=395611, SELECTED_ROWGROUPS=41,
TOTAL_IO_TIME_NS=1234990631]]
> {code}
> I think this is happening because the fetcher is in [this loop|https://github.com/apache/tez/blob/fe22f3276d6d97f6b5dfab24490ee2ca32bf73c3/tez-runtime-library/src/main/java/org/apache/tez/runtime/library/common/shuffle/impl/ShuffleManager.java#L366]
which is started [here|https://github.com/apache/tez/blob/fe22f3276d6d97f6b5dfab24490ee2ca32bf73c3/tez-runtime-internals/src/main/java/org/apache/tez/runtime/task/TaskRunner2Callable.java#L69]
and not stopped until the [processor is closed|https://github.com/apache/tez/blob/fe22f3276d6d97f6b5dfab24490ee2ca32bf73c3/tez-runtime-internals/src/main/java/org/apache/tez/runtime/LogicalIOProcessorRuntimeTask.java#L387]
(which is called from [here|https://github.com/apache/tez/blob/fe22f3276d6d97f6b5dfab24490ee2ca32bf73c3/tez-runtime-internals/src/main/java/org/apache/tez/runtime/task/TaskRunner2Callable.java#L83]
or [here|https://github.com/apache/tez/blob/fe22f3276d6d97f6b5dfab24490ee2ca32bf73c3/tez-runtime-internals/src/main/java/org/apache/tez/runtime/task/TaskRunner2Callable.java#L110]).
> In {{ShuffleManager}} we seem to keep track of {{TaskCounter.NUM_FAILED_SHUFFLE_INPUTS}}
but nothing is done when the value gets too high. Maybe something similar to {{ShuffleScheduler}}
should be done where the fetchers are only retried a certain amount of times.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Mime
View raw message