Return-Path: Delivered-To: apmail-lucene-hadoop-dev-archive@locus.apache.org Received: (qmail 48364 invoked from network); 11 Jan 2007 20:01:33 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 11 Jan 2007 20:01:33 -0000 Received: (qmail 67154 invoked by uid 500); 11 Jan 2007 20:00:28 -0000 Delivered-To: apmail-lucene-hadoop-dev-archive@lucene.apache.org Received: (qmail 65916 invoked by uid 500); 11 Jan 2007 20:00:23 -0000 Mailing-List: contact hadoop-dev-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: hadoop-dev@lucene.apache.org Delivered-To: mailing list hadoop-dev@lucene.apache.org Received: (qmail 65606 invoked by uid 99); 11 Jan 2007 19:58:00 -0000 Received: from herse.apache.org (HELO herse.apache.org) (140.211.11.133) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 11 Jan 2007 11:58:00 -0800 X-ASF-Spam-Status: No, hits=2.6 required=10.0 tests=HTML_00_10,HTML_MESSAGE,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (herse.apache.org: domain of bpendleton@gmail.com designates 66.249.82.235 as permitted sender) Received: from [66.249.82.235] (HELO wx-out-0506.google.com) (66.249.82.235) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 11 Jan 2007 11:57:51 -0800 Received: by wx-out-0506.google.com with SMTP id i29so553311wxd for ; Thu, 11 Jan 2007 11:57:30 -0800 (PST) DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=beta; h=received:message-id:date:from:sender:to:subject:mime-version:content-type:x-google-sender-auth; b=MAdHbnp2DzhVtbqx2ew6Y/fh36qNNMDqLKCePj2II4Cv+q/U53wzsGqE5lJoJbx9HTqVe85vocjUmVE0yixbv2+GmHJelZjl/sELGm3QApXbPkvqCi/TVlGe4ky3K5k0XYUBSfaZ/EaSSpUmdxW9UDKLKVfM5LNyhzX6IY71Ihs= Received: by 10.70.40.1 with SMTP id n1mr3606947wxn.1168545450386; Thu, 11 Jan 2007 11:57:30 -0800 (PST) Received: by 10.70.125.16 with HTTP; Thu, 11 Jan 2007 11:57:30 -0800 (PST) Message-ID: <1bf79d3e0701111157o7e6b2916ka8da09ddc9dc98f9@mail.gmail.com> Date: Thu, 11 Jan 2007 11:57:30 -0800 From: "Bryan A. P. Pendleton" Sender: bpendleton@gmail.com To: hadoop-dev@lucene.apache.org Subject: What's wrong with speculative execution, again? MIME-Version: 1.0 Content-Type: multipart/alternative; boundary="----=_Part_26056_26279604.1168545450144" X-Google-Sender-Auth: 3011113ebf9c0132 X-Virus-Checked: Checked by ClamAV on apache.org ------=_Part_26056_26279604.1168545450144 Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit Content-Disposition: inline I know the default was changed to "off" because of some bug. What's the nature of the problem? I ran a job last night that held for a long time because a job somehow got assigned to a tasktracker that wasn't taking tasks - the task stayed as "UNASSIGNED" in status indefinitely - I eventually killed the tasktracker, which let the total job finish. Had speculative execution been going, there'd've been no problem here. Not sure if this is a new bug, or somehow related to the core speculative execution bug, but, it'd also be nice to have speculative execution turned back on, as it really does drop the turnaround time on jobs. I'm now regularly running jobs that occupy ~100 CPUs for a half day or so, and the lack of speculative execution plus the occasional wacky machine causes the turnaround on these jobs to go up by large fractions of the total job time, so I'd love to see this problem go (back) away. -- Bryan A. P. Pendleton Ph: (877) geek-1-bp ------=_Part_26056_26279604.1168545450144--