Return-Path: X-Original-To: apmail-lucene-solr-user-archive@minotaur.apache.org Delivered-To: apmail-lucene-solr-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 1719117DD8 for ; Thu, 25 Sep 2014 17:37:16 +0000 (UTC) Received: (qmail 60745 invoked by uid 500); 25 Sep 2014 17:37:10 -0000 Delivered-To: apmail-lucene-solr-user-archive@lucene.apache.org Received: (qmail 60680 invoked by uid 500); 25 Sep 2014 17:37:10 -0000 Mailing-List: contact solr-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: solr-user@lucene.apache.org Delivered-To: mailing list solr-user@lucene.apache.org Received: (qmail 60667 invoked by uid 99); 25 Sep 2014 17:37:10 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 25 Sep 2014 17:37:10 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of bretthoerner@gmail.com designates 209.85.213.179 as permitted sender) Received: from [209.85.213.179] (HELO mail-ig0-f179.google.com) (209.85.213.179) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 25 Sep 2014 17:36:44 +0000 Received: by mail-ig0-f179.google.com with SMTP id l13so8825732iga.0 for ; Thu, 25 Sep 2014 10:36:43 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:sender:from:date:message-id:subject:to:content-type; bh=UgLCVaZQjdRPX0aXIEEWJivC7aMYA+3kW75CZ+aOito=; b=QQl09N4O58cG1Iu4Zn3PUaTUS6cH7NY/pu8ltO/UvTyg7q2G4EBnvsEPoDq+h8Ai3Q UxejRqbvsmwwbsfwtUTuOnmS4kzRiKHSmJSACt4ZFKzqMLkRKCXNhBOMwUyt1eM9qlKR jBK9T2WJlakiAFktESJfWdJiNR1WgzsL1H6wqY99S9mSoOPkTb0nxQ1kyXv6S4wUk+5L 8fnGpqB3Q9S7PSAL8/yLMKNDSHeqxfVi9eLEqpuWX+nagXgZJQ22lIkh3Ijhs7LvHsig plqfw4QHP+v12ELVQ6cM9cQy620XcdW26ccbNY1+99P6GabBjZoxKFnan+a/ky5CMJJA 9V+Q== X-Received: by 10.50.73.130 with SMTP id l2mr24168044igv.9.1411666603241; Thu, 25 Sep 2014 10:36:43 -0700 (PDT) MIME-Version: 1.0 Sender: bretthoerner@gmail.com Received: by 10.107.161.84 with HTTP; Thu, 25 Sep 2014 10:36:03 -0700 (PDT) From: Brett Hoerner Date: Thu, 25 Sep 2014 12:36:03 -0500 X-Google-Sender-Auth: AfzUsmohzpFXtXCPV49AlsOW-DE Message-ID: Subject: Solr mapred MTree merge stage ~6x slower in 4.10 To: solr-user@lucene.apache.org Content-Type: multipart/alternative; boundary=089e01160edc6d9bbc0503e73e52 X-Virus-Checked: Checked by ClamAV on apache.org --089e01160edc6d9bbc0503e73e52 Content-Type: text/plain; charset=UTF-8 As an update to this thread, it seems my MTree wasn't completely hanging, it was just much slower in 4.10. If I replace 4.9.0 with 4.10 in my jar the MTree merge stage is 6x (or more) slower (in my case, 20 min becomes 2 hours). I hope to bisect this in the future, but the jobs I'm running take a long time. I haven't tried to see if the issue shows on smaller jobs yet (does 1 minute become 6 minutes?). Brett On Tue, Sep 16, 2014 at 12:54 PM, Brett Hoerner wrote: > I have a very weird problem that I'm going to try to describe here to see > if anyone has any "ah-ha" moments or clues. I haven't created a small > reproducible project for this but I guess I will have to try in the future > if I can't figure it out. (Or I'll need to bisect by running long Hadoop > jobs...) > > So, the facts: > > * Have been successfully using Solr mapred to build very large Solr > clusters for months > * As of Solr 4.10 *some* job sizes repeatably hang in the MTree merge > phase in 4.10 > * Those same jobs (same input, output, and Hadoop cluster itself) succeed > if I only change my Solr deps to 4.9 > * The job *does succeed* in 4.10 if I use the same data to create more, > but smaller shards (e.g. 12x as many shards each 1/12th the size of the job > that fails) > * Creating my "normal size" shards (the size I want, that works in 4.9) > the job hangs with 2 mappers running, 0 reducers in the MTree merge phase > * There are no errors or warning in the syslog/stderr of the MTree > mappers, no errors ever echo'd back to the "interactive run" of the job > (mapper says 100%, reduce says 0%, will stay forever) > * No CPU being used on the boxes running the merge, no GC happening, JVM > waiting on a futex, all threads blocked on various queues > * No disk usage problems, nothing else obviously wrong with any box in the > cluster > > I diff'ed around between 4.10 and 4.9 and barely see any changes in mapred > contrib, mostly some test stuff. I didn't see any transitive dependency > changes in Solr/Lucene that look like they would affect me. > --089e01160edc6d9bbc0503e73e52--