Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 8B8D4200C30 for ; Tue, 7 Mar 2017 20:14:43 +0100 (CET) Received: by cust-asf.ponee.io (Postfix) id 8A4E6160B82; Tue, 7 Mar 2017 19:14:43 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id DD699160B68 for ; Tue, 7 Mar 2017 20:14:42 +0100 (CET) Received: (qmail 97671 invoked by uid 500); 7 Mar 2017 19:14:41 -0000 Mailing-List: contact mapreduce-issues-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list mapreduce-issues@hadoop.apache.org Received: (qmail 97465 invoked by uid 99); 7 Mar 2017 19:14:41 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd1-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 07 Mar 2017 19:14:41 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd1-us-west.apache.org (ASF Mail Server at spamd1-us-west.apache.org) with ESMTP id 2F26BC14DC for ; Tue, 7 Mar 2017 19:14:41 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd1-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 0.651 X-Spam-Level: X-Spam-Status: No, score=0.651 tagged_above=-999 required=6.31 tests=[RP_MATCHES_RCVD=-0.001, SPF_NEUTRAL=0.652] autolearn=disabled Received: from mx1-lw-us.apache.org ([10.40.0.8]) by localhost (spamd1-us-west.apache.org [10.40.0.7]) (amavisd-new, port 10024) with ESMTP id Ph4gVVh0eBkn for ; Tue, 7 Mar 2017 19:14:40 +0000 (UTC) Received: from mailrelay1-us-west.apache.org (mailrelay1-us-west.apache.org [209.188.14.139]) by mx1-lw-us.apache.org (ASF Mail Server at mx1-lw-us.apache.org) with ESMTP id 1886161F00 for ; Tue, 7 Mar 2017 19:14:40 +0000 (UTC) Received: from jira-lw-us.apache.org (unknown [207.244.88.139]) by mailrelay1-us-west.apache.org (ASF Mail Server at mailrelay1-us-west.apache.org) with ESMTP id F1CA0E0A2B for ; Tue, 7 Mar 2017 19:14:38 +0000 (UTC) Received: from jira-lw-us.apache.org (localhost [127.0.0.1]) by jira-lw-us.apache.org (ASF Mail Server at jira-lw-us.apache.org) with ESMTP id 3A7F824184 for ; Tue, 7 Mar 2017 19:14:38 +0000 (UTC) Date: Tue, 7 Mar 2017 19:14:38 +0000 (UTC) From: "Yufei Gu (JIRA)" To: mapreduce-issues@hadoop.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Updated] (MAPREDUCE-6858) HistoryFileManager thrashing due to high volume jobs MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 archived-at: Tue, 07 Mar 2017 19:14:43 -0000 [ https://issues.apache.org/jira/browse/MAPREDUCE-6858?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yufei Gu updated MAPREDUCE-6858: -------------------------------- Description: - JHS scans "done_intermediate" dir for files to process and adds them to a thread pool - Thread pool starts processing these files to move them to "done" dir - JHS scans "done_intermediate" again for files to process and adds them to a thread pool -- If we have enough jobs where the thread pool can't keep up with the scanning interval, they'll get added twice (or more). If this keeps compounding, jobs end up would pile up and not getting processed for quite some time and getting lots of FileNotFoundException's. By default, it looks like the thread pool only has 3 threads in it (mapreduce.jobhistory.move.thread-count). And the scan interval is 3 minutes (mapreduce.jobhistory.move.interval-ms). Perhaps we should increase these? was: - JHS scans "done_intermediate" dir for files to process and adds them to a thread pool - Thread pool starts processing these files to move them to "done" dir - JHS scans "done_intermediate" again for files to process and adds them to a thread pool -- If we have enough jobs where the thread pool can't keep up with the scanning interval, they'll get added twice (or more). If this keeps compounding, I wouldn't be surprised if jobs end up piling up and not getting processed for quite some time and getting lots of FileNotFoundException's. By default, it looks like the thread pool only has 3 threads in it (mapreduce.jobhistory.move.thread-count). And the scan interval is 3 minutes (mapreduce.jobhistory.move.interval-ms). Perhaps we should increase these? > HistoryFileManager thrashing due to high volume jobs > ----------------------------------------------------- > > Key: MAPREDUCE-6858 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-6858 > Project: Hadoop Map/Reduce > Issue Type: Bug > Components: jobhistoryserver > Reporter: Yufei Gu > > - JHS scans "done_intermediate" dir for files to process and adds them to a thread pool > - Thread pool starts processing these files to move them to "done" dir > - JHS scans "done_intermediate" again for files to process and adds them to a thread pool > -- If we have enough jobs where the thread pool can't keep up with the scanning interval, they'll get added twice (or more). If this keeps compounding, jobs end up would pile up and not getting processed for quite some time and getting lots of FileNotFoundException's. > By default, it looks like the thread pool only has 3 threads in it (mapreduce.jobhistory.move.thread-count). And the scan interval is 3 minutes (mapreduce.jobhistory.move.interval-ms). Perhaps we should increase these? -- This message was sent by Atlassian JIRA (v6.3.15#6346) --------------------------------------------------------------------- To unsubscribe, e-mail: mapreduce-issues-unsubscribe@hadoop.apache.org For additional commands, e-mail: mapreduce-issues-help@hadoop.apache.org