Return-Path: X-Original-To: apmail-tez-issues-archive@minotaur.apache.org Delivered-To: apmail-tez-issues-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 4A2AE1880C for ; Wed, 24 Feb 2016 00:07:18 +0000 (UTC) Received: (qmail 10713 invoked by uid 500); 24 Feb 2016 00:07:18 -0000 Delivered-To: apmail-tez-issues-archive@tez.apache.org Received: (qmail 10656 invoked by uid 500); 24 Feb 2016 00:07:18 -0000 Mailing-List: contact issues-help@tez.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@tez.apache.org Delivered-To: mailing list issues@tez.apache.org Received: (qmail 10642 invoked by uid 99); 24 Feb 2016 00:07:18 -0000 Received: from arcas.apache.org (HELO arcas) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 24 Feb 2016 00:07:18 +0000 Received: from arcas.apache.org (localhost [127.0.0.1]) by arcas (Postfix) with ESMTP id 16BBD2C1F58 for ; Wed, 24 Feb 2016 00:07:18 +0000 (UTC) Date: Wed, 24 Feb 2016 00:07:18 +0000 (UTC) From: "Jeff Zhang (JIRA)" To: issues@tez.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (TEZ-3124) Running task hangs due to missing event to initialize input in recovery MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/TEZ-3124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15159919#comment-15159919 ] Jeff Zhang commented on TEZ-3124: --------------------------------- RecoveryParser will scan the recovery files from the oldest attempt to latest attempt. And will pick up the latest VertexInitializedEvent if there's multiple VertexInitializedEvent. The root cause is that we will still log VertexInitializedEvent if shouldSkipInit() is true and in this cause there's zero initGeneratedEvent in the VertexInitializedEvent which cause the next recovery hangs. I can reproduce it in the test case and verify it is resolved after this patch. > Running task hangs due to missing event to initialize input in recovery > ----------------------------------------------------------------------- > > Key: TEZ-3124 > URL: https://issues.apache.org/jira/browse/TEZ-3124 > Project: Apache Tez > Issue Type: Bug > Affects Versions: 0.8.2 > Reporter: Jeff Zhang > Assignee: Jeff Zhang > Labels: Recovery > Fix For: 0.8.3 > > Attachments: TEZ-3124-1.patch, TEZ-3124-2.patch, TEZ-3124-3.patch, TEZ-3124-4.patch, a.log > > > {noformat} > 2016-02-09 04:48:42 Starting to run new task attempt: attempt_1454993155302_0001_1_00_000061_3 > /attempt_1454993155302_0001_1_00_000061 > 2016-02-09 04:48:43,196 [INFO] [I/O Setup 0 Initialize: {MRInput}] |input.MRInput|: MRInput using newmapreduce API=true, split via event=true, numPhysicalInputs=1 > 2016-02-09 04:48:43,200 [INFO] [I/O Setup 0 Initialize: {MRInput}] |input.MRInputLegacy|: MRInput MRInputLegacy deferring initialization > 2016-02-09 04:48:43,333 [INFO] [TezChild] |runtime.LogicalIOProcessorRuntimeTask|: Initialized processor > 2016-02-09 04:48:43,333 [INFO] [TezChild] |runtime.LogicalIOProcessorRuntimeTask|: Waiting for 2 initializers to finish > 2016-02-09 04:48:43,333 [INFO] [TezChild] |runtime.LogicalIOProcessorRuntimeTask|: Waiting for 1 initializers to finish > 2016-02-09 04:48:43,333 [INFO] [TezChild] |runtime.LogicalIOProcessorRuntimeTask|: All initializers finished > 2016-02-09 04:48:43,345 [INFO] [TezChild] |resources.MemoryDistributor|: InitialRequests=[MRInput:INPUT:0:org.apache.tez.mapreduce.input.MRInputLegacy], [ireduce1:OUTPUT:1802502144:org.apache.tez.runtime.library.output.OrderedPartitionedKVOutput] > 2016-02-09 04:48:43,559 [INFO] [TezChild] |resources.WeightedScalingMemoryDistributor|: ScaleRatiosUsed=[PARTITIONED_UNSORTED_OUTPUT:1][UNSORTED_OUTPUT:1][UNSORTED_INPUT:1][SORTED_OUTPUT:12][SORTED_MERGED_INPUT:12][PROCESSOR:1][OTHER:1] > 2016-02-09 04:48:43,563 [INFO] [TezChild] |resources.WeightedScalingMemoryDistributor|: InitialReservationFraction=0.3, AdditionalReservationFractionForIOs=0.03, finalReserveFractionUsed=0.32999999999999996 > 2016-02-09 04:48:43,564 [INFO] [TezChild] |resources.WeightedScalingMemoryDistributor|: Scaling Requests. NumRequests: 2, numScaledRequests: 13, TotalRequested: 1802502144, TotalRequestedScaled: 1.663848132923077E9, TotalJVMHeap: 2577399808, TotalAvailable: 1726857871, TotalRequested/TotalJVMHeap:0.70 > 2016-02-09 04:48:43,564 [INFO] [TezChild] |resources.MemoryDistributor|: Allocations=[MRInput:org.apache.tez.mapreduce.input.MRInputLegacy:INPUT:0:0], [ireduce1:org.apache.tez.runtime.library.output.OrderedPartitionedKVOutput:OUTPUT:1802502144:1726857871] > 2016-02-09 04:48:43,564 [INFO] [TezChild] |runtime.LogicalIOProcessorRuntimeTask|: Starting Inputs/Outputs > 2016-02-09 04:48:43,572 [INFO] [I/O Setup 1 Start: {MRInput}] |runtime.LogicalIOProcessorRuntimeTask|: Started Input with src edge: MRInput > 2016-02-09 04:48:43,572 [INFO] [TezChild] |runtime.LogicalIOProcessorRuntimeTask|: Input: MRInput being auto started by the framework. Subsequent instances will not be auto-started > 2016-02-09 04:48:43,573 [INFO] [TezChild] |runtime.LogicalIOProcessorRuntimeTask|: Num IOs determined for AutoStart: 1 > 2016-02-09 04:48:43,574 [INFO] [TezChild] |runtime.LogicalIOProcessorRuntimeTask|: Waiting for 1 IOs to start > 2016-02-09 04:48:43,574 [INFO] [TezChild] |runtime.LogicalIOProcessorRuntimeTask|: AutoStartComplete > 2016-02-09 04:48:43,583 [INFO] [TezChild] |task.TaskRunner2Callable|: Running task, taskAttemptId=attempt_1454993155302_0001_1_00_000061_3 > 2016-02-09 04:48:43,583 [INFO] [TezChild] |map.MapProcessor|: Running map: attempt_1454993155302_0001_1_00_000061_3_10001 > 2016-02-09 04:48:43,675 [INFO] [TezChild] |impl.ExternalSorter|: ireduce1 using: memoryMb=1646, keySerializerClass=class org.apache.hadoop.io.IntWritable, valueSerializerClass=org.apache.hadoop.io.serializer.WritableSerialization$WritableSerializer@5f143de6, comparator=org.apache.hadoop.io.IntWritable$Comparator@ec52d1f, partitioner=org.apache.tez.mapreduce.partition.MRPartitioner, serialization=org.apache.hadoop.io.serializer.WritableSerialization > 2016-02-09 04:48:43,686 [INFO] [TezChild] |impl.PipelinedSorter|: Setting up PipelinedSorter for ireduce1: , UsingHashComparator=false > 2016-02-09 04:48:45,093 [INFO] [TezChild] |impl.PipelinedSorter|: Newly allocated block size=1725956096, index=0, Number of buffers=1, currentAllocatableMemory=0, currentBufferSize=1725956096, total=1725956096 > 2016-02-09 04:48:45,093 [INFO] [TezChild] |impl.PipelinedSorter|: Pre allocating rest of memory buffers upfront > 2016-02-09 04:48:45,093 [INFO] [TezChild] |impl.PipelinedSorter|: Setting up PipelinedSorter for ireduce1: , UsingHashComparator=false#blocks=1, maxMemUsage=1725956096, lazyAllocateMem=false, minBlockSize=2097152000, initial BLOCK_SIZE=1725956096, finalMergeEnabled=true, pipelinedShuffle=false, sendEmptyPartitions=true, tez.runtime.io.sort.mb=1646 > 2016-02-09 04:48:45,099 [INFO] [TezChild] |impl.PipelinedSorter|: ireduce1: reserved.remaining()=1725956096, reserved.metasize=16777216 > 2016-02-09 04:48:45,175 [INFO] [TezChild] |input.MRInput|: Initialized MRInput: MRInput > 2016-02-09 08:55:40,790 [INFO] [TaskHeartbeatThread] |task.TaskReporter|: Received should die response from AM > 2016-02-09 08:55:40,791 [INFO] [TaskHeartbeatThread] |task.TaskReporter|: Asked to die via task heartbeat > 2016-02-09 08:55:40,796 [INFO] [TaskHeartbeatThread] |task.TezTaskRunner2|: Attempting to abort attempt_1454993155302_0001_1_00_000061_3 due to an invocation of shutdownRequested > 2016-02-09 08:55:40,804 [INFO] [TezChild] |task.TaskRunner2Callable|: Cleaning up task attempt_1454993155302_0001_1_00_000061_3, stopRequested=true > 2016-02-09 08:55:40,805 [INFO] [TezChild] |runtime.LogicalIOProcessorRuntimeTask|: Final Counters for attempt_1454993155302_0001_1_00_000061_3: Counters: 20 [[org.apache.tez.common.counters.TaskCounter SPILLED_RECORDS=0, INPUT_RECORDS_PROCESSED=0, OUTPUT_RECORDS=0, OUTPUT_BYTES=0, OUTPUT_BYTES_WITH_OVERHEAD=0, OUTPUT_BYTES_PHYSICAL=0, ADDITIONAL_SPILLS_BYTES_WRITTEN=0, ADDITIONAL_SPILLS_BYTES_READ=0, ADDITIONAL_SPILL_COUNT=0, SHUFFLE_CHUNK_COUNT=0][TaskCounter_map_INPUT_MRInput INPUT_RECORDS_PROCESSED=0][TaskCounter_map_OUTPUT_ireduce1 ADDITIONAL_SPILLS_BYTES_READ=0, ADDITIONAL_SPILLS_BYTES_WRITTEN=0, ADDITIONAL_SPILL_COUNT=0, OUTPUT_BYTES=0, OUTPUT_BYTES_PHYSICAL=0, OUTPUT_BYTES_WITH_OVERHEAD=0, OUTPUT_RECORDS=0, SHUFFLE_CHUNK_COUNT=0, SPILLED_RECORDS=0]] > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)