Return-Path: X-Original-To: apmail-hadoop-common-user-archive@www.apache.org Delivered-To: apmail-hadoop-common-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 53CA710FAD for ; Fri, 25 Oct 2013 19:41:01 +0000 (UTC) Received: (qmail 67893 invoked by uid 500); 25 Oct 2013 19:40:50 -0000 Delivered-To: apmail-hadoop-common-user-archive@hadoop.apache.org Received: (qmail 67802 invoked by uid 500); 25 Oct 2013 19:40:49 -0000 Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hadoop.apache.org Delivered-To: mailing list user@hadoop.apache.org Received: (qmail 67794 invoked by uid 99); 25 Oct 2013 19:40:48 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 25 Oct 2013 19:40:48 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,NORMAL_HTTP_TO_IP,RCVD_IN_DNSWL_LOW,SPF_PASS,WEIRD_PORT X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of alexjbohr@gmail.com designates 209.85.215.173 as permitted sender) Received: from [209.85.215.173] (HELO mail-ea0-f173.google.com) (209.85.215.173) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 25 Oct 2013 19:40:43 +0000 Received: by mail-ea0-f173.google.com with SMTP id g10so914046eak.4 for ; Fri, 25 Oct 2013 12:40:22 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; bh=Uap079MggU8AUDcbw3EjvBGxBakCaRw/yGu1y7lV8Js=; b=SU//sIPpAqhwbp0HZ6WiFGYrlSPbuCR/P1IMacL+i9way9yWzqJOTsAOQ6kFzw5ZQE M5x3nffK1AHihtIke/R4OrBfepurWZhXCsKXQyINsUNXacGXZmubk5zqvXPbKbpNE1Bk uxIUJYxaufd9DNt80NXLPAJYlnEzJVk6wtYJN9W8FsG/fKuFzcSpN50gEz4cv9XkbSQz 6Caf0Kfq9WMChR01x6wRmf6PiLpg3WGbaVbkBo1rpmWXNwZrUI+hsBuTLdaOyuLwhxZJ 2qUrbyC+r3x+199kkAwO34I4M0MxWN7tB3Tz2dSmzvaI9nY0ZbfV9SfoYknSajw9J1M9 3ZQg== MIME-Version: 1.0 X-Received: by 10.14.107.68 with SMTP id n44mr9597807eeg.26.1382730022407; Fri, 25 Oct 2013 12:40:22 -0700 (PDT) Received: by 10.223.175.136 with HTTP; Fri, 25 Oct 2013 12:40:22 -0700 (PDT) In-Reply-To: References: Date: Fri, 25 Oct 2013 12:40:22 -0700 Message-ID: Subject: Re: DFSClient: Could not complete write history logs From: alex bohr To: user@hadoop.apache.org Content-Type: multipart/alternative; boundary=001a11c29ad2ce55c704e995eb4f X-Virus-Checked: Checked by ClamAV on apache.org --001a11c29ad2ce55c704e995eb4f Content-Type: text/plain; charset=ISO-8859-1 I should add we recently changed some mapred-site properties on the Jobtracker to tone down how much history the job-tracker stores in memory. Is it possible these settings are too agressive and the jobtracker is removing old jobs from memory as it's trying to write the status of a running job? Here's properties we recently changed: mapred.job.tracker.retiredjobs.cache.size 100 mapreduce.job.user.name hdfs mapred.jobtracker.completeuserjobs.maximum 25 mapred.jobtracker.retirejob.interval 86400000 mapred.jobtracker.retirejob.check 3600000 mapred.job.tracker 10.4.41.207:9001 I've used "mapred.job.tracker.retiredjobs.cache.size" previously and I'm very certain it was originally responsible for preventing weekly crashes of the JobTracker, but the other setting we introduced for the first time. Thanks On Fri, Oct 25, 2013 at 11:17 AM, alex bohr wrote: > Hi, > I've suddenly been having the JobTracker freeze up every couple hours when > it goes into a loop trying to write Job history files. > > I get the error in various job but it's always on writing the > "_logs/history" files. > > I'm running MRv1: Hadoop 2.0.0-cdh4.4.0 > > Here's a sample error: > "2013-10-25 01:59:54,445 INFO org.apache.hadoop.hdfs.DFSClient: Could not > complete > /user/etl/pipeline/stage02/b0c6fc02-1729-4a57-8799-553f4dd789a4/_logs/history/job_201310242314_0013_1382663618303_gxetl_GX-ETL.Bucketer > retrying.." > > I have to stop and restart the jobtracker and then it happens again, and > the intervals between errors have been getting shorter. > > I see this ticket: > https://issues.apache.org/jira/browse/HDFS-1059 > But I ran fsck and the report say 0 corrupt and 0 under-replicated blocks. > > I also found this thread: > > http://mail-archives.apache.org/mod_mbox/hadoop-common-user/201110.mbox/%3CCAF8-MNf7P_Kr8SNHBng1cDJ70vGET58_V+JNMA21OWymrc1aVA@mail.gmail.com%3E > > I'm not familiar with the different IO schedulers, so before I change this > on all our datanodes - *does anyone recommend using deadline instead of > CFQ? * > We are using Ext4 file system on our datanodes which have 24 drives (we > checked for any bad drives and found one that wasn't responding and pulled > it from the config for that machine but errors keep happening). > > Or any other advice on addressing this inifinite loop beyond IO scheduler > is much appreciated. > Thanks, > Alex > > > > > --001a11c29ad2ce55c704e995eb4f Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable
I should add we recently changed some mapred-site properti= es on the Jobtracker to tone down how much history the job-tracker stores i= n memory. =A0Is it possible these settings are too agressive and the jobtra= cker is removing old jobs from memory as it's trying to write the statu= s of a running job?

Here's properties we recently changed:
= =A0 =A0 <property>
=A0 =A0 =A0 <name>mapred.job.track= er.retiredjobs.cache.size</name>
=A0 =A0 =A0 <value>1= 00</value>
=A0 =A0 </property>
=A0 =A0 <property>
=A0 =A0 =A0 <name>mapredu= ce.job.user.name</name>
=A0 =A0 =A0 <value>hdfs&l= t;/value>
=A0 =A0 </property>
=A0 =A0 <property>
=A0 =A0 =A0 <name>mapred.jobtracker.completeuserjobs.maximum</nam= e>
=A0 =A0 =A0 <value>25</value>
=A0 =A0= </property>
=A0 =A0 <property>
=A0 =A0 =A0 <name>mapred.jobt= racker.retirejob.interval</name>
=A0 =A0 =A0 <value>8= 6400000</value>
=A0 =A0 </property>
=A0 =A0= <property>
=A0 =A0 =A0 <name>mapred.jobtracker.retirejob.check</name>=
=A0 =A0 =A0 <value>3600000</value>
=A0 =A0= </property>
=A0 =A0 <property>
=A0 =A0 =A0= <name>mapred.job.tracker</name>
=A0 =A0 =A0 <value>10.4.41.2= 07:9001</value>
=A0 =A0 </property>
<= ;/configuration>


I've = used "mapred.job.tracker.retiredjobs.cache.size" previously and I= 'm very certain it was originally responsible for preventing weekly cra= shes of the JobTracker, but the other setting we introduced for the first t= ime.

Thanks=A0



On Fri, Oct 25, 2013 at 11:17 AM= , alex bohr <alexjbohr@gmail.com> wrote:
Hi,
I've= suddenly been having the JobTracker freeze up every couple hours when it g= oes into a loop trying to write Job history files.

I get the error in various job but it's always on w= riting the "_logs/history" files.

I'm running MRv1: Hadoop 2.0.0-cdh4.4.0
<= br>
Here's a sample error:
"2013-10-25 01:59:54,445= INFO org.apache.hadoop.hdfs.DFSClient: Could not complete /user/etl/pipeli= ne/stage02/b0c6fc02-1729-4a57-8799-553f4dd789a4/_logs/history/job_201310242= 314_0013_1382663618303_gxetl_GX-ETL.Bucketer retrying.."

I have to stop and restart the jobtracker and then it happen= s again, and the intervals between errors have been getting shorter.
<= div>
I see this ticket:
But I ran fsck and the report say 0 corrupt and 0 under-replicated blo= cks.

I also found this thread:=A0

I'm not familiar with the different IO schedulers, = so before I change this on all our datanodes - does anyone recommend usi= ng deadline instead of CFQ? =A0
We are using Ext4 file system= on our datanodes which have 24 drives (we checked for any bad drives and f= ound one that wasn't responding and pulled it from the config for that = machine but errors keep happening).

Or any other advice on addressing this inifinite loop b= eyond IO scheduler is much appreciated.
Thanks,
Alex





--001a11c29ad2ce55c704e995eb4f--