From issues-return-39553-archive-asf-public=cust-asf.ponee.io@tez.apache.org Tue Aug 7 02:19:05 2018 Return-Path: X-Original-To: archive-asf-public@cust-asf.ponee.io Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by mx-eu-01.ponee.io (Postfix) with SMTP id C0E03180627 for ; Tue, 7 Aug 2018 02:19:04 +0200 (CEST) Received: (qmail 54754 invoked by uid 500); 7 Aug 2018 00:19:03 -0000 Mailing-List: contact issues-help@tez.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@tez.apache.org Delivered-To: mailing list issues@tez.apache.org Received: (qmail 54736 invoked by uid 99); 7 Aug 2018 00:19:03 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd1-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 07 Aug 2018 00:19:03 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd1-us-west.apache.org (ASF Mail Server at spamd1-us-west.apache.org) with ESMTP id 8C8CDC6A7A for ; Tue, 7 Aug 2018 00:19:03 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd1-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: -110.051 X-Spam-Level: X-Spam-Status: No, score=-110.051 tagged_above=-999 required=6.31 tests=[ENV_AND_HDR_SPF_MATCH=-0.5, KAM_LOTSOFHASH=0.25, RCVD_IN_DNSWL_MED=-2.3, SPF_PASS=-0.001, USER_IN_DEF_SPF_WL=-7.5, USER_IN_WHITELIST=-100] autolearn=disabled Received: from mx1-lw-us.apache.org ([10.40.0.8]) by localhost (spamd1-us-west.apache.org [10.40.0.7]) (amavisd-new, port 10024) with ESMTP id cMwL1k04laLx for ; Tue, 7 Aug 2018 00:19:01 +0000 (UTC) Received: from mailrelay1-us-west.apache.org (mailrelay1-us-west.apache.org [209.188.14.139]) by mx1-lw-us.apache.org (ASF Mail Server at mx1-lw-us.apache.org) with ESMTP id EF0365F3D0 for ; Tue, 7 Aug 2018 00:19:00 +0000 (UTC) Received: from jira-lw-us.apache.org (unknown [207.244.88.139]) by mailrelay1-us-west.apache.org (ASF Mail Server at mailrelay1-us-west.apache.org) with ESMTP id 6B717E0026 for ; Tue, 7 Aug 2018 00:19:00 +0000 (UTC) Received: from jira-lw-us.apache.org (localhost [127.0.0.1]) by jira-lw-us.apache.org (ASF Mail Server at jira-lw-us.apache.org) with ESMTP id 119A723F9C for ; Tue, 7 Aug 2018 00:19:00 +0000 (UTC) Date: Tue, 7 Aug 2018 00:19:00 +0000 (UTC) From: "Jaume M (JIRA)" To: issues@tez.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Updated] (TEZ-3976) ShuffleManager reporting too many errors MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/TEZ-3976?page=3Dcom.atlassian.= jira.plugin.system.issuetabpanels:all-tabpanel ] Jaume M updated TEZ-3976: ------------------------- Summary: ShuffleManager reporting too many errors (was: {{ShuffleManag= er}} reporting too many errors) > ShuffleManager reporting too many errors > ---------------------------------------- > > Key: TEZ-3976 > URL: https://issues.apache.org/jira/browse/TEZ-3976 > Project: Apache Tez > Issue Type: Bug > Reporter: Jaume M > Priority: Major > > The symptoms are a lot of these logs are being shown: > {code:java} > 2018-06-15T18:09:35,811 INFO [Fetcher_B {Reducer_5} #0 ()] org.apache.te= z.runtime.library.common.shuffle.impl.ShuffleManager: Reducer_5: Fetch fail= ed for src: InputAttemptIdentifier [inputIdentifier=3D701, attemptNumber=3D= 0, pathComponent=3Dattempt_1529044441963_0021_34_01_000701_0_12541_0, spill= Type=3D2, spillId=3D0]InputIdentifier: InputAttemptIdentifier [inputIdentif= ier=3D701, attemptNumber=3D0, pathComponent=3Dattempt_1529044441963_0021_34= _01_000701_0_12541_0, spillType=3D2, spillId=3D0], connectFailed: true > 2018-06-15T18:09:35,811 WARN [Fetcher_B {Reducer_5} #1 ()] org.apache.te= z.runtime.library.common.shuffle.Fetcher: copyInputs failed for tasks [Inpu= tAttemptIdentifier [inputIdentifier=3D589, attemptNumber=3D0, pathComponent= =3Dattempt_1529044441963_0021_34_01_000589_0_12445_0, spillType=3D2, spillI= d=3D0]] > 2018-06-15T18:09:35,811 INFO [Fetcher_B {Reducer_5} #1 ()] org.apache.te= z.runtime.library.common.shuffle.impl.ShuffleManager: Reducer_5: Fetch fail= ed for src: InputAttemptIdentifier [inputIdentifier=3D589, attemptNumber=3D= 0, pathComponent=3Dattempt_1529044441963_0021_34_01_000589_0_12445_0, spill= Type=3D2, spillId=3D0]InputIdentifier: InputAttemptIdentifier [inputIdentif= ier=3D589, attemptNumber=3D0, pathComponent=3Dattempt_1529044441963_0021_34= _01_000589_0_12445_0, spillType=3D2, spillId=3D0], connectFailed: true > {code} > Each of those translate into an event in the AM which finally crashes due= to OOM after around 30 minutes and around 10 million shuffle input errors = (and 10 million lines like the previous ones). When the ShufflerManager is = closed and the counters reported there are many shuffle input errors, some = of those logs are: > {code:java} > 2018-06-15T17:46:30,988 INFO [TezTR-441963_21_34_4_0_4 (1529044441963_00= 21_34_04_000000_4)] runtime.LogicalIOProcessorRuntimeTask: Final Counters f= or attempt_1529044441963_0021_34_04_000000_4: Counters: 43 [[org.apache.tez= .common.counters.TaskCounter SPILLED_RECORDS=3D0, NUM_SHUFFLED_INPUTS=3D26,= NUM_FAILED_SHUFFLE_INPUTS=3D858965, INPUT_RECORDS_PROCESSED=3D26, OUTPUT_R= ECORDS=3D1, OUTPUT_LARGE_RECORDS=3D0, OUTPUT_BYTES=3D779472, OUTPUT_BYTES_W= ITH_OVERHEAD=3D779483, OUTPUT_BYTES_PHYSICAL=3D780146, ADDITIONAL_SPILLS_BY= TES_WRITTEN=3D0, ADDITIONAL_SPILLS_BYTES_READ=3D0, ADDITIONAL_SPILL_COUNT= =3D0, SHUFFLE_BYTES=3D4207563, SHUFFLE_BYTES_DECOMPRESSED=3D20266603, SHUFF= LE_BYTES_TO_MEM=3D3380616, SHUFFLE_BYTES_TO_DISK=3D0, SHUFFLE_BYTES_DISK_DI= RECT=3D826947, SHUFFLE_PHASE_TIME=3D52516, FIRST_EVENT_RECEIVED=3D1, LAST_E= VENT_RECEIVED=3D1185][HIVE RECORDS_OUT_INTERMEDIATE_^[[1;35;40m^[[KReducer_= 12^[[m^[[K=3D1, RECORDS_OUT_OPERATOR_GBY_159=3D1, RECORDS_OUT_OPERATOR_RS_1= 60=3D1][TaskCounter_^[[1;35;40m^[[KReducer_12^[[m^[[K_INPUT_Map_11 FIRST_EV= ENT_RECEIVED=3D1, INPUT_RECORDS_PROCESSED=3D26, LAST_EVENT_RECEIVED=3D1185,= NUM_FAILED_SHUFFLE_INPUTS=3D858965, NUM_SHUFFLED_INPUTS=3D26, SHUFFLE_BYTE= S=3D4207563, SHUFFLE_BYTES_DECOMPRESSED=3D20266603, SHUFFLE_BYTES_DISK_DIRE= CT=3D826947, SHUFFLE_BYTES_TO_DISK=3D0, SHUFFLE_BYTES_TO_MEM=3D3380616, SHU= FFLE_PHASE_TIME=3D52516][TaskCounter_^[[1;35;40m^[[KReducer_12^[[m^[[K_OUTP= UT_Map_1 ADDITIONAL_SPILLS_BYTES_READ=3D0, ADDITIONAL_SPILLS_BYTES_WRITTEN= =3D0, ADDITIONAL_SPILL_COUNT=3D0, OUTPUT_BYTES=3D779472, OUTPUT_BYTES_PHYSI= CAL=3D780146, OUTPUT_BYTES_WITH_OVERHEAD=3D779483, OUTPUT_LARGE_RECORDS=3D0= , OUTPUT_RECORDS=3D1, SPILLED_RECORDS=3D0]] > 2018-06-15T17:46:32,271 INFO [TezTR-441963_21_34_3_15_1 ()] org.apache.t= ez.runtime.LogicalIOProcessorRuntimeTask: Final Counters for attempt_152904= 4441963_0021_34_03_000015_1: Counters: 87 [[File System Counters FILE_BYTES= _READ=3D0, FILE_BYTES_WRITTEN=3D0, FILE_READ_OPS=3D0, FILE_LARGE_READ_OPS= =3D0, FILE_WRITE_OPS=3D0, HDFS_BYTES_READ=3D2344929, HDFS_BYTES_WRITTEN=3D0= , HDFS_READ_OPS=3D5, HDFS_LARGE_READ_OPS=3D0, HDFS_WRITE_OPS=3D0][org.apach= e.tez.common.counters.TaskCounter SPILLED_RECORDS=3D0, NUM_SHUFFLED_INPUTS= =3D1, NUM_FAILED_SHUFFLE_INPUTS=3D105195, INPUT_RECORDS_PROCESSED=3D397, IN= PUT_SPLIT_LENGTH_BYTES=3D21563271, OUTPUT_RECORDS=3D15737, OUTPUT_LARGE_REC= ORDS=3D0, OUTPUT_BYTES=3D1235818, OUTPUT_BYTES_WITH_OVERHEAD=3D1267307, OUT= PUT_BYTES_PHYSICAL=3D357520, ADDITIONAL_SPILLS_BYTES_WRITTEN=3D0, ADDITIONA= L_SPILLS_BYTES_READ=3D0, ADDITIONAL_SPILL_COUNT=3D0, SHUFFLE_BYTES=3D31, SH= UFFLE_BYTES_DECOMPRESSED=3D17, SHUFFLE_BYTES_TO_MEM=3D31, SHUFFLE_BYTES_TO_= DISK=3D0, SHUFFLE_BYTES_DISK_DIRECT=3D0, SHUFFLE_PHASE_TIME=3D50525, FIRST_= EVENT_RECEIVED=3D9, LAST_EVENT_RECEIVED=3D61][HIVE DESERIALIZE_ERRORS=3D0, = RECORDS_IN_Map_11=3D395611, RECORDS_OUT_INTERMEDIATE_Map_11=3D15737, RECORD= S_OUT_OPERATOR_FIL_152=3D395611, RECORDS_OUT_OPERATOR_GBY_157=3D1, RECORDS_= OUT_OPERATOR_MAPJOIN_154=3D15736, RECORDS_OUT_OPERATOR_MAP_0=3D0, RECORDS_O= UT_OPERATOR_RS_155=3D15736, RECORDS_OUT_OPERATOR_RS_158=3D1, RECORDS_OUT_OP= ERATOR_SEL_153=3D395611, RECORDS_OUT_OPERATOR_SEL_156=3D15736, RECORDS_OUT_= OPERATOR_TS_26=3D395611][TaskCounter_Map_11_INPUT_Map_13 FIRST_EVENT_RECEIV= ED=3D9, INPUT_RECORDS_PROCESSED=3D1, LAST_EVENT_RECEIVED=3D61, NUM_FAILED_S= HUFFLE_INPUTS=3D105195, NUM_SHUFFLED_INPUTS=3D1, SHUFFLE_BYTES=3D31, SHUFFL= E_BYTES_DECOMPRESSED=3D17, SHUFFLE_BYTES_DISK_DIRECT=3D0, SHUFFLE_BYTES_TO_= DISK=3D0, SHUFFLE_BYTES_TO_MEM=3D31, SHUFFLE_PHASE_TIME=3D50525][TaskCounte= r_Map_11_INPUT_supplier INPUT_RECORDS_PROCESSED=3D396, INPUT_SPLIT_LENGTH_B= YTES=3D21563271][TaskCounter_Map_11_OUTPUT_Reducer_12 ADDITIONAL_SPILLS_BYT= ES_READ=3D0, ADDITIONAL_SPILLS_BYTES_WRITTEN=3D0, ADDITIONAL_SPILL_COUNT=3D= 0, OUTPUT_BYTES=3D779474, OUTPUT_BYTES_PHYSICAL=3D164787, OUTPUT_BYTES_WITH= _OVERHEAD=3D779485, OUTPUT_LARGE_RECORDS=3D0, OUTPUT_RECORDS=3D1, SPILLED_R= ECORDS=3D0][TaskCounter_Map_11_OUTPUT_Reducer_6 ADDITIONAL_SPILLS_BYTES_REA= D=3D0, ADDITIONAL_SPILLS_BYTES_WRITTEN=3D0, ADDITIONAL_SPILL_COUNT=3D0, OUT= PUT_BYTES=3D456344, OUTPUT_BYTES_PHYSICAL=3D192733, OUTPUT_BYTES_WITH_OVERH= EAD=3D487822, OUTPUT_LARGE_RECORDS=3D0, OUTPUT_RECORDS=3D15736, SPILLED_REC= ORDS=3D0][org.apache.hadoop.hive.llap.counters.LlapIOCounters ALLOCATED_BYT= ES=3D9633792, ALLOCATED_USED_BYTES=3D7976706, CACHE_HIT_BYTES=3D0, CACHE_MI= SS_BYTES=3D2344364, CONSUMER_TIME_NS=3D1136392475, DECODE_TIME_NS=3D1403779= 15, HDFS_TIME_NS=3D145825282, METADATA_CACHE_MISS=3D4, NUM_DECODED_BATCHES= =3D41, NUM_VECTOR_BATCHES=3D396, ROWS_EMITTED=3D395611, SELECTED_ROWGROUPS= =3D41, TOTAL_IO_TIME_NS=3D1234990631]] > {code} > I think this is happening because the fetcher is in [this loop|https://gi= thub.com/apache/tez/blob/fe22f3276d6d97f6b5dfab24490ee2ca32bf73c3/tez-runti= me-library/src/main/java/org/apache/tez/runtime/library/common/shuffle/impl= /ShuffleManager.java#L366] which is started [here|https://github.com/apache= /tez/blob/fe22f3276d6d97f6b5dfab24490ee2ca32bf73c3/tez-runtime-internals/sr= c/main/java/org/apache/tez/runtime/task/TaskRunner2Callable.java#L69] and n= ot stopped until the [processor is closed|https://github.com/apache/tez/blo= b/fe22f3276d6d97f6b5dfab24490ee2ca32bf73c3/tez-runtime-internals/src/main/j= ava/org/apache/tez/runtime/LogicalIOProcessorRuntimeTask.java#L387] (which = is called from [here|https://github.com/apache/tez/blob/fe22f3276d6d97f6b5d= fab24490ee2ca32bf73c3/tez-runtime-internals/src/main/java/org/apache/tez/ru= ntime/task/TaskRunner2Callable.java#L83] or [here|https://github.com/apache= /tez/blob/fe22f3276d6d97f6b5dfab24490ee2ca32bf73c3/tez-runtime-internals/sr= c/main/java/org/apache/tez/runtime/task/TaskRunner2Callable.java#L110]). > In {{ShuffleManager}} we seem to keep track of {{TaskCounter.NUM_FAILED_S= HUFFLE_INPUTS}} but nothing is done when the value gets too high. Maybe som= ething similar to {{ShuffleScheduler}} should be done where the fetchers ar= e only retried a certain amount of times. -- This message was sent by Atlassian JIRA (v7.6.3#76005)