Return-Path: X-Original-To: apmail-hadoop-mapreduce-user-archive@minotaur.apache.org Delivered-To: apmail-hadoop-mapreduce-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 55B5F10E6A for ; Fri, 23 Jan 2015 21:41:29 +0000 (UTC) Received: (qmail 18196 invoked by uid 500); 23 Jan 2015 21:41:24 -0000 Delivered-To: apmail-hadoop-mapreduce-user-archive@hadoop.apache.org Received: (qmail 18080 invoked by uid 500); 23 Jan 2015 21:41:24 -0000 Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hadoop.apache.org Delivered-To: mailing list user@hadoop.apache.org Received: (qmail 18070 invoked by uid 99); 23 Jan 2015 21:41:23 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 23 Jan 2015 21:41:23 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of ajohnson@etsy.com designates 209.85.214.177 as permitted sender) Received: from [209.85.214.177] (HELO mail-ob0-f177.google.com) (209.85.214.177) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 23 Jan 2015 21:40:58 +0000 Received: by mail-ob0-f177.google.com with SMTP id uy5so9226607obc.8 for ; Fri, 23 Jan 2015 13:40:56 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=etsy.com; s=google; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; bh=WAAOa6fi3FoLVj4YexioKaPljrfb1C4v9Z2ftkMK65I=; b=Zt1MYTgPrO1Qu7ty+oUG1M/q/EFSg90gchykWBvsYb24bqhkQEeO9VVd7aSCzzMvis hpapfDF8Z+70C3TVg53pFHbTWY90PCWfY4y7qqk/6OgCaav8BK0noTO1h0xtKhsoG3RF 1mT/E2+sf18ZxCc8bawo6IZKHdRnLjpyOPR7s= X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:in-reply-to:references:date :message-id:subject:from:to:content-type; bh=WAAOa6fi3FoLVj4YexioKaPljrfb1C4v9Z2ftkMK65I=; b=F+Xoq+Afl4TS0yqn7YRrK0LXBuMl70ijOQtEVbCch2GPZPn79FX6uAKh/LHr0M/ynP ynsKJ20ybMtSv5m0xrUAV86hseMdCa6iawFT/NFuYtjD9ZlEv6Lf+H2SMDY25m/kFJ5j s7u/6ef8YEHJ4a8zT2uBFDsB9mErhheZKUrk6jjLoLBLAed5271UJUMqkYiTwS1cfNvk 2d3OKllnkc0YD3r/XYYlDJJuntNACI5Q0jdznnar3n4SYj+FB7Py1K9Ou15kdqFUjuXF S6UsVVyxjmp7q+FBSWTQ6Q6Z0UCmBlG7btcaQI9UmyPYIdr7Aww7GExeAQwutmVP+Lr9 WTcA== X-Gm-Message-State: ALoCoQl0tdK7IfQoTx5j7nZyTMtY/RCE6cz35fIReGkeiaEG5yXelUFme9gjCzNRmouZ7ua8hRpL MIME-Version: 1.0 X-Received: by 10.60.23.99 with SMTP id l3mr4936533oef.66.1422049256603; Fri, 23 Jan 2015 13:40:56 -0800 (PST) Received: by 10.182.56.130 with HTTP; Fri, 23 Jan 2015 13:40:56 -0800 (PST) In-Reply-To: References: Date: Fri, 23 Jan 2015 16:40:56 -0500 Message-ID: Subject: Re: Mapreduce HistoryServer Hangs Processing Large Jobs From: Andrew Johnson To: user@hadoop.apache.org Content-Type: multipart/alternative; boundary=047d7b33d636cb1a1d050d58a460 X-Virus-Checked: Checked by ClamAV on apache.org --047d7b33d636cb1a1d050d58a460 Content-Type: text/plain; charset=UTF-8 Hi Chris, I've created the MAPREDUCE jira: https://issues.apache.org/jira/browse/MAPREDUCE-6222 Would it be useful to have the actual history file? I can truncate it some if the full 500 Mb is too much. On Fri, Jan 23, 2015 at 4:30 PM, Chris Nauroth wrote: > Hi Andrew, > > I haven't seen this myself, but I can tell from the jstack output that > numerous threads are blocked waiting to run synchronized methods. The most > interesting thread is the one currently holding the lock (stack trace > truncated a bit): > > "1592215278@qtp-1568218435-195" daemon prio=10 tid=0x000000000135d000 > nid=0xb076 runnable [0x00007f2d7b4f1000] > java.lang.Thread.State: RUNNABLE > at java.lang.StringCoding.encode(StringCoding.java:364) > at java.lang.String.getBytes(String.java:939) > at org.apache.avro.util.Utf8$2.toUtf8(Utf8.java:123) > at org.apache.avro.util.Utf8.getBytesFor(Utf8.java:172) > at org.apache.avro.util.Utf8.(Utf8.java:39) > at org.apache.avro.io.JsonDecoder.readString(JsonDecoder.java:214) > at > org.apache.avro.io.ResolvingDecoder.readString(ResolvingDecoder.java:201) > at > org.apache.avro.generic.GenericDatumReader.readString(GenericDatumReader.java:363) > at > org.apache.avro.generic.GenericDatumReader.readString(GenericDatumReader.java:355) > at > org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:157) > at > org.apache.avro.generic.GenericDatumReader.readField(GenericDatumReader.java:193) > at > org.apache.avro.generic.GenericDatumReader.readRecord(GenericDatumReader.java:183) > at > org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:151) > at > org.apache.avro.generic.GenericDatumReader.readArray(GenericDatumReader.java:219) > at > org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:153) > at > org.apache.avro.generic.GenericDatumReader.readField(GenericDatumReader.java:193) > at > org.apache.avro.generic.GenericDatumReader.readRecord(GenericDatumReader.java:183) > at > org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:151) > at > org.apache.avro.generic.GenericDatumReader.readArray(GenericDatumReader.java:219) > at > org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:153) > at > org.apache.avro.generic.GenericDatumReader.readField(GenericDatumReader.java:193) > at > org.apache.avro.generic.GenericDatumReader.readRecord(GenericDatumReader.java:183) > at > org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:151) > at > org.apache.avro.generic.GenericDatumReader.readField(GenericDatumReader.java:193) > at > org.apache.avro.generic.GenericDatumReader.readRecord(GenericDatumReader.java:183) > at > org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:151) > at > org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:155) > at > org.apache.avro.generic.GenericDatumReader.readField(GenericDatumReader.java:193) > at > org.apache.avro.generic.GenericDatumReader.readRecord(GenericDatumReader.java:183) > at > org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:151) > at > org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:142) > at > org.apache.hadoop.mapreduce.jobhistory.EventReader.getNextEvent(EventReader.java:89) > at > org.apache.hadoop.mapreduce.jobhistory.JobHistoryParser.parse(JobHistoryParser.java:111) > - locked <0x00007f344a0cf0d8> (a > org.apache.hadoop.mapreduce.jobhistory.JobHistoryParser) > at > org.apache.hadoop.mapreduce.jobhistory.JobHistoryParser.parse(JobHistoryParser.java:153) > - locked <0x00007f344a0cf0d8> (a > org.apache.hadoop.mapreduce.jobhistory.JobHistoryParser) > at > org.apache.hadoop.mapreduce.jobhistory.JobHistoryParser.parse(JobHistoryParser.java:139) > - locked <0x00007f344a0cf0d8> (a > org.apache.hadoop.mapreduce.jobhistory.JobHistoryParser) > at > org.apache.hadoop.mapreduce.v2.hs.CompletedJob.loadFullHistoryData(CompletedJob.java:338) > - locked <0x00007f344a0c2388> (a > org.apache.hadoop.mapreduce.v2.hs.CompletedJob) > at > org.apache.hadoop.mapreduce.v2.hs.CompletedJob.(CompletedJob.java:101) > at > org.apache.hadoop.mapreduce.v2.hs.HistoryFileManager$HistoryFileInfo.loadJob(HistoryFileManager.java:413) > - locked <0x00007f2df576ffb0> (a > org.apache.hadoop.mapreduce.v2.hs.HistoryFileManager$HistoryFileInfo) > at > org.apache.hadoop.mapreduce.v2.hs.CachedHistoryStorage.loadJob(CachedHistoryStorage.java:106) > > This seems to indicate the thread is stuck deserializing Avro data from > the history file. Perhaps it's something particular to the data in your > history files. I do see an open Avro issue reporting an infinite loop > condition during deserialization. > > https://issues.apache.org/jira/browse/AVRO-1422 > > I don't know Avro well enough to be certain that this is the root cause > though. > > Do you want to submit a MAPREDUCE jira with this information? Even if the > root cause is in Avro, we'd want to track upgrading our Avro dependency > once a fix becomes available. Thanks! > > Chris Nauroth > Hortonworks > http://hortonworks.com/ > > > On Fri, Jan 23, 2015 at 12:42 PM, Andrew Johnson > wrote: > >> Hey everyone, >> >> I'm encountering an issue with the Mapreduce HistoryServer processing the >> history files for large jobs. This has come up several times with for jobs >> with around 60000 total tasks. When the HistoryServer loads the .jhist >> file from HDFS for a job of that size (which is usually around 500 Mb), the >> HistoryServer's CPU usage spiked and the UI became unresponsive. After >> about 10 minutes I restarted the HistoryServer and it was behaving normally >> again. >> >> The cluster is running CDH 5.3 (2.5.0-cdh5.3.0). I've attached the >> output of jstack from a time this was occurring. I do have an example >> .jhist file that caused the problem, but have not attached it due to its >> size. >> >> Has anyone else seen this happen before? >> >> Thanks for your help! >> >> -- >> Andrew Johnson >> > > > CONFIDENTIALITY NOTICE > NOTICE: This message is intended for the use of the individual or entity > to which it is addressed and may contain information that is confidential, > privileged and exempt from disclosure under applicable law. If the reader > of this message is not the intended recipient, you are hereby notified that > any printing, copying, dissemination, distribution, disclosure or > forwarding of this communication is strictly prohibited. If you have > received this communication in error, please contact the sender immediately > and delete it from your system. Thank You. -- Andrew Johnson Software Engineer, Etsy --047d7b33d636cb1a1d050d58a460 Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable
Hi Chris,

I've created the MAPREDUC= E jira:=C2=A0https://issues.apache.org/jira/browse/MAPREDUCE-6222

<= /div>
Would it be useful to have the actual history file? I can truncat= e it some if the full 500 Mb is too much.

On Fri, Jan 23, 2015 at 4:30 PM, Chris = Nauroth <cnauroth@hortonworks.com> wrote:
Hi Andrew,

I hav= en't seen this myself, but I can tell from the jstack output that numer= ous threads are blocked waiting to run synchronized methods.=C2=A0 The most= interesting thread is the one currently holding the lock (stack trace trun= cated a bit):

"1592215278@qtp-1568218435= -195" daemon prio=3D10 tid=3D0x000000000135d000 nid=3D0xb076 runnable = [0x00007f2d7b4f1000]
=C2=A0 =C2=A0java.lang.Thread.State: RUNNABL= E
at java.lang.Strin= gCoding.encode(StringCoding.java:364)
at java.lang.String.getBytes(String.java:939)
<= span style=3D"white-space:pre-wrap"> at org.apache.avro.util.Utf8$2.= toUtf8(Utf8.java:123)
at org.apache.avro.util.Utf8.getBytesFor(Utf8.java:172)
at org.apache.avro.util.Utf8.<init= >(Utf8.java:39)
a= t org.apache.avro.io.JsonDecoder.readString(JsonDecoder.java:214)
at org.apache.avro.io.Resolvi= ngDecoder.readString(ResolvingDecoder.java:201)
at org.apache.avro.generic.GenericDatumReader.r= eadString(GenericDatumReader.java:363)
at org.apache.avro.generic.GenericDatumReader.readString= (GenericDatumReader.java:355)
at org.apache.avro.generic.GenericDatumReader.read(GenericDatumRe= ader.java:157)
at or= g.apache.avro.generic.GenericDatumReader.readField(GenericDatumReader.java:= 193)
at org.apache.a= vro.generic.GenericDatumReader.readRecord(GenericDatumReader.java:183)
at org.apache.avro.gener= ic.GenericDatumReader.read(GenericDatumReader.java:151)
at org.apache.avro.generic.GenericDatum= Reader.readArray(GenericDatumReader.java:219)
at org.apache.avro.generic.GenericDatumReader.rea= d(GenericDatumReader.java:153)
at org.apache.avro.generic.GenericDatumReader.readField(GenericD= atumReader.java:193)
at org.apache.avro.generic.GenericDatumReader.readRecord(GenericDatumReade= r.java:183)
at org.a= pache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:151)
at org.apache.avro.gene= ric.GenericDatumReader.readArray(GenericDatumReader.java:219)
at org.apache.avro.generic.Generi= cDatumReader.read(GenericDatumReader.java:153)
at org.apache.avro.generic.GenericDatumReader.re= adField(GenericDatumReader.java:193)
at org.apache.avro.generic.GenericDatumReader.readRecord(G= enericDatumReader.java:183)
= at org.apache.avro.generic.GenericDatumReader.read(GenericDatumRead= er.java:151)
at org.= apache.avro.generic.GenericDatumReader.readField(GenericDatumReader.java:19= 3)
at org.apache.avr= o.generic.GenericDatumReader.readRecord(GenericDatumReader.java:183)
<= div> at org.apache.avro.generic= .GenericDatumReader.read(GenericDatumReader.java:151)
at org.apache.avro.generic.GenericDatumRe= ader.read(GenericDatumReader.java:155)
at org.apache.avro.generic.GenericDatumReader.readField(= GenericDatumReader.java:193)
at org.apache.avro.generic.GenericDatumReader.readRecord(GenericDa= tumReader.java:183)
= at org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:= 151)
at org.apache.a= vro.generic.GenericDatumReader.read(GenericDatumReader.java:142)
= at org.apache.hadoop.mapreduce= .jobhistory.EventReader.getNextEvent(EventReader.java:89)
at org.apache.hadoop.mapreduce.jobhis= tory.JobHistoryParser.parse(JobHistoryParser.java:111)
- locked <0x00007f344a0cf0d8> (a o= rg.apache.hadoop.mapreduce.jobhistory.JobHistoryParser)
at org.apache.hadoop.mapreduce.jobhisto= ry.JobHistoryParser.parse(JobHistoryParser.java:153)
- locked <0x00007f344a0cf0d8> (a or= g.apache.hadoop.mapreduce.jobhistory.JobHistoryParser)
at org.apache.hadoop.mapreduce.jobhistor= y.JobHistoryParser.parse(JobHistoryParser.java:139)
- locked <0x00007f344a0cf0d8> (a or= g.apache.hadoop.mapreduce.jobhistory.JobHistoryParser)
at org.apache.hadoop.mapreduce.v2.hs.Com= pletedJob.loadFullHistoryData(CompletedJob.java:338)
- locked <0x00007f344a0c2388> (a or= g.apache.hadoop.mapreduce.v2.hs.CompletedJob)
at org.apache.hadoop.mapreduce.v2.hs.CompletedJob= .<init>(CompletedJob.java:101)
at org.apache.hadoop.mapreduce.v2.hs.HistoryFileManager$Hi= storyFileInfo.loadJob(HistoryFileManager.java:413)
- locked <0x00007f2df576ffb0> (a org.a= pache.hadoop.mapreduce.v2.hs.HistoryFileManager$HistoryFileInfo)
= at org.apache.hadoop.mapreduce= .v2.hs.CachedHistoryStorage.loadJob(CachedHistoryStorage.java:106)

This seems to indicate the thread is stuck deseriali= zing Avro data from the history file.=C2=A0 Perhaps it's something part= icular to the data in your history files.=C2=A0 I do see an open Avro issue= reporting an infinite loop condition during deserialization.
https://issues.apache.org/jira/browse/AVRO-1422

I don't know Avro well enough to be certain that t= his is the root cause though.

Do you want to submi= t a MAPREDUCE jira with this information?=C2=A0 Even if the root cause is i= n Avro, we'd want to track upgrading our Avro dependency once a fix bec= omes available.=C2=A0 Thanks!

Chris Nauroth
Hort= onworks


On Fri, Jan 23, 2015 at 12:42 PM, Andrew Joh= nson <ajohnson@etsy.com> wrote:
Hey everyone,

I'm encounterin= g an issue with the Mapreduce HistoryServer processing the history files fo= r large jobs.=C2=A0 This has come up several times with for jobs with aroun= d 60000 total tasks.=C2=A0 When the HistoryServer loads the .jhist file fro= m HDFS for a job of that size (which is usually around 500 Mb), the History= Server's CPU usage spiked and the UI became unresponsive.=C2=A0 After a= bout 10 minutes I restarted the HistoryServer and it was behaving normally = again.

The cluster is running CDH 5.3 (2.5.0-cdh5.3.0).=C2=A0 I've attached the output of = jstack from a time this was occurring.=C2=A0 I do have an example .jhist fi= le that caused the problem, but have not attached it due to its size.

Has anyone else seen this happen before?
=
Thanks for your help!

--
Andrew Johnson


CONFIDENTIALITY NOTICE
NOTICE: This message is = intended for the use of the individual or entity to which it is addressed a= nd may contain information that is confidential, privileged and exempt from= disclosure under applicable law. If the reader of this message is not the = intended recipient, you are hereby notified that any printing, copying, dis= semination, distribution, disclosure or forwarding of this communication is= strictly prohibited. If you have received this communication in error, ple= ase contact the sender immediately and delete it from your system. Thank Yo= u.



--
Andrew Johnson
Software En= gineer, Etsy
--047d7b33d636cb1a1d050d58a460--