From mapreduce-issues-return-88327-apmail-hadoop-mapreduce-issues-archive=hadoop.apache.org@hadoop.apache.org Tue Mar 7 19:17:42 2017 Return-Path: X-Original-To: apmail-hadoop-mapreduce-issues-archive@minotaur.apache.org Delivered-To: apmail-hadoop-mapreduce-issues-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id ED2C919A47 for ; Tue, 7 Mar 2017 19:17:42 +0000 (UTC) Received: (qmail 33913 invoked by uid 500); 7 Mar 2017 19:17:42 -0000 Delivered-To: apmail-hadoop-mapreduce-issues-archive@hadoop.apache.org Received: (qmail 33862 invoked by uid 500); 7 Mar 2017 19:17:42 -0000 Mailing-List: contact mapreduce-issues-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list mapreduce-issues@hadoop.apache.org Received: (qmail 33851 invoked by uid 99); 7 Mar 2017 19:17:42 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd3-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 07 Mar 2017 19:17:42 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd3-us-west.apache.org (ASF Mail Server at spamd3-us-west.apache.org) with ESMTP id 16499186274 for ; Tue, 7 Mar 2017 19:17:42 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd3-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 0.651 X-Spam-Level: X-Spam-Status: No, score=0.651 tagged_above=-999 required=6.31 tests=[RP_MATCHES_RCVD=-0.001, SPF_NEUTRAL=0.652] autolearn=disabled Received: from mx1-lw-us.apache.org ([10.40.0.8]) by localhost (spamd3-us-west.apache.org [10.40.0.10]) (amavisd-new, port 10024) with ESMTP id xnAdmtc6bWc1 for ; Tue, 7 Mar 2017 19:17:41 +0000 (UTC) Received: from mailrelay1-us-west.apache.org (mailrelay1-us-west.apache.org [209.188.14.139]) by mx1-lw-us.apache.org (ASF Mail Server at mx1-lw-us.apache.org) with ESMTP id 9BFEE60E08 for ; Tue, 7 Mar 2017 19:17:40 +0000 (UTC) Received: from jira-lw-us.apache.org (unknown [207.244.88.139]) by mailrelay1-us-west.apache.org (ASF Mail Server at mailrelay1-us-west.apache.org) with ESMTP id C0C5DE08C3 for ; Tue, 7 Mar 2017 19:17:38 +0000 (UTC) Received: from jira-lw-us.apache.org (localhost [127.0.0.1]) by jira-lw-us.apache.org (ASF Mail Server at jira-lw-us.apache.org) with ESMTP id 1ED9924178 for ; Tue, 7 Mar 2017 19:17:38 +0000 (UTC) Date: Tue, 7 Mar 2017 19:17:38 +0000 (UTC) From: "Yufei Gu (JIRA)" To: mapreduce-issues@hadoop.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Updated] (MAPREDUCE-6858) HistoryFileManager thrashing due to high volume jobs MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/MAPREDUCE-6858?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yufei Gu updated MAPREDUCE-6858: -------------------------------- Description: JHS log shows that it tried to move the same *.jhist twice, and the second moving causes FileNotFoundException's. - JHS scans "done_intermediate" dir for files to process and adds them to a thread pool - Thread pool starts processing these files to move them to "done" dir - JHS scans "done_intermediate" again for files to process and adds them to a thread pool -- If we have enough jobs where the thread pool can't keep up with the scanning interval, they'll get added twice (or more). If this keeps compounding, jobs end up would pile up and not getting processed for quite some time and getting lots of FileNotFoundException's. By default, it looks like the thread pool only has 3 threads in it (mapreduce.jobhistory.move.thread-count). And the scan interval is 3 minutes (mapreduce.jobhistory.move.interval-ms). Perhaps we should increase these? was: The log of JHS shows that it tried to move the same *.jhist twice, and the second moving causes FileNotFoundException's. - JHS scans "done_intermediate" dir for files to process and adds them to a thread pool - Thread pool starts processing these files to move them to "done" dir - JHS scans "done_intermediate" again for files to process and adds them to a thread pool -- If we have enough jobs where the thread pool can't keep up with the scanning interval, they'll get added twice (or more). If this keeps compounding, jobs end up would pile up and not getting processed for quite some time and getting lots of FileNotFoundException's. By default, it looks like the thread pool only has 3 threads in it (mapreduce.jobhistory.move.thread-count). And the scan interval is 3 minutes (mapreduce.jobhistory.move.interval-ms). Perhaps we should increase these? > HistoryFileManager thrashing due to high volume jobs > ----------------------------------------------------- > > Key: MAPREDUCE-6858 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-6858 > Project: Hadoop Map/Reduce > Issue Type: Bug > Components: jobhistoryserver > Reporter: Yufei Gu > > JHS log shows that it tried to move the same *.jhist twice, and the second moving causes FileNotFoundException's. > - JHS scans "done_intermediate" dir for files to process and adds them to a thread pool > - Thread pool starts processing these files to move them to "done" dir > - JHS scans "done_intermediate" again for files to process and adds them to a thread pool > -- If we have enough jobs where the thread pool can't keep up with the scanning interval, they'll get added twice (or more). If this keeps compounding, jobs end up would pile up and not getting processed for quite some time and getting lots of FileNotFoundException's. > By default, it looks like the thread pool only has 3 threads in it (mapreduce.jobhistory.move.thread-count). And the scan interval is 3 minutes (mapreduce.jobhistory.move.interval-ms). Perhaps we should increase these? -- This message was sent by Atlassian JIRA (v6.3.15#6346) --------------------------------------------------------------------- To unsubscribe, e-mail: mapreduce-issues-unsubscribe@hadoop.apache.org For additional commands, e-mail: mapreduce-issues-help@hadoop.apache.org