Return-Path: X-Original-To: apmail-incubator-mesos-dev-archive@minotaur.apache.org Delivered-To: apmail-incubator-mesos-dev-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 7D33610AA4 for ; Mon, 1 Jul 2013 20:18:55 +0000 (UTC) Received: (qmail 40177 invoked by uid 500); 1 Jul 2013 20:18:55 -0000 Delivered-To: apmail-incubator-mesos-dev-archive@incubator.apache.org Received: (qmail 40142 invoked by uid 500); 1 Jul 2013 20:18:55 -0000 Mailing-List: contact mesos-dev-help@incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: mesos-dev@incubator.apache.org Delivered-To: mailing list mesos-dev@incubator.apache.org Received: (qmail 40134 invoked by uid 99); 1 Jul 2013 20:18:55 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 01 Jul 2013 20:18:55 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,NORMAL_HTTP_TO_IP,RCVD_IN_DNSWL_LOW,WEIRD_PORT X-Spam-Check-By: apache.org Received-SPF: error (athena.apache.org: local policy) Received: from [209.85.214.171] (HELO mail-ob0-f171.google.com) (209.85.214.171) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 01 Jul 2013 20:18:50 +0000 Received: by mail-ob0-f171.google.com with SMTP id dn14so4824282obc.30 for ; Mon, 01 Jul 2013 13:18:09 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type:x-gm-message-state; bh=ctojOmjOenw9T+VeXa+KpuFIMCReBFgBVOhhtoD3w+c=; b=K2FTyPng1RLQMj/D8wQcvIfrHwi4L/eXfrH7rUD7A/J873BPtTf3y8/sIxq8bCQMv4 WEnF70owyMjosUGEDiR1d5CsO7bB/TPZh0+Sw+X883emg2a4S55EYkrOeh5jRUxiLtgi 5ZmlxiEs/XCGwQJeTx/BSGK4UvI8g2Z/t6nzSZQTnm+/Ys8Xd9ANp48T0dXrNzlbQDO8 dsFc1qPpSlmKLp3eSA+EHJ1UAYr6baJra63rbx9J0X3+3TkaG6udGfSub08ayd4HvjZI MHA2jz2y9WSli8WnJmdqTL64v6gyLfgTJCh9Wg9hC+uoFOVSk2lxcq29nArxiZCvNDuB SPgg== MIME-Version: 1.0 X-Received: by 10.182.111.199 with SMTP id ik7mr2836086obb.44.1372709889587; Mon, 01 Jul 2013 13:18:09 -0700 (PDT) Received: by 10.182.46.225 with HTTP; Mon, 1 Jul 2013 13:18:09 -0700 (PDT) In-Reply-To: References: Date: Mon, 1 Jul 2013 13:18:09 -0700 Message-ID: Subject: Re: Slave crash From: Brenden Matthews To: mesos Content-Type: multipart/alternative; boundary=089e0149cd4a59454c04e078ede9 X-Gm-Message-State: ALoCoQmkyvWFgEzcuUjCorcBHpSNNX10lWmV6cG4Wtn9BC5utsMA0ffKwjALTjnSaY6+/9ubwsU0 X-Virus-Checked: Checked by ClamAV on apache.org --089e0149cd4a59454c04e078ede9 Content-Type: text/plain; charset=ISO-8859-1 It looks like stime is wacky: I0701 20:16:45.686691 16005 process_isolator.cpp:99] Launching > executor_Task_Tracker_39837 (cd hadoop-* && ./bin/mesos-executor) in > /tmp/mesos/slaves/201307011931-3660134922-5050-27773-98/frameworks/201306291951-3660134922-5050-13580-0002/executors/executor_Task_Tracker_39837/runs/b04e3e21-27df-49c6-92ff-e2d8c445cde5 > with resources cpus=1; mem=5000' for framework > 201306291951-3660134922-5050-13580-0002 > I0701 20:16:45.687299 15993 slave.cpp:510] Successfully attached file > '/tmp/mesos/slaves/201307011931-3660134922-5050-27773-98/frameworks/201306291951-3660134922-5050-13580-0002/executors/executor_Task_Tracker_39837/runs/b04e3e21-27df-49c6-92ff-e2d8c445cde5' > I0701 20:16:45.689007 16005 process_isolator.cpp:161] Forked executor at > 16141 > Fetching resources into > '/tmp/mesos/slaves/201307011931-3660134922-5050-27773-98/frameworks/201306291951-3660134922-5050-13580-0002/executors/executor_Task_Tracker_39837/runs/b04e3e21-27df-49c6-92ff-e2d8c445cde5' > Fetching resource 'hdfs://airfs-h1/hadoop-2.0.0-mr1-cdh4.2.1-mesos.tar.xz' > Downloading resource from > 'hdfs://airfs-h1/hadoop-2.0.0-mr1-cdh4.2.1-mesos.tar.xz' > HDFS command: hadoop fs -copyToLocal > 'hdfs://airfs-h1/hadoop-2.0.0-mr1-cdh4.2.1-mesos.tar.xz' > './hadoop-2.0.0-mr1-cdh4.2.1-mesos.tar.xz' > Extracting resource: tar xJf './hadoop-2.0.0-mr1-cdh4.2.1-mesos.tar.xz' > utime=0 stime=0 ticks=100 > utime=0 stime=1 ticks=100 > utime=0 stime=1 ticks=100 > utime=0 stime=1 ticks=100 > utime=0 stime=1 ticks=100 > utime=0 stime=1 ticks=100 > utime=0 stime=1 ticks=100 > utime=80 stime=537 ticks=100 > utime=9 stime=0 ticks=100 > utime=0 stime=2432 ticks=100 > utime=0 stime=4344 ticks=100 > utime=1589 stime=0 ticks=100 > utime=0 stime=1608 ticks=100 > utime=0 stime=0 ticks=100 > utime=2 stime=4286 ticks=100 > utime=1139 stime=0 ticks=100 > utime=0 stime=1571 ticks=100 > utime=0 stime=0 ticks=100 > utime=0 stime=1740 ticks=100 > utime=158 stime=958 ticks=100 > utime=0 stime=1403 ticks=100 > utime=35 stime=1527 ticks=100 > utime=1033 stime=0 ticks=100 > utime=0 stime=1514 ticks=100 > utime=8 stime=1454 ticks=100 > utime=1046 stime=0 ticks=100 > utime=0 stime=1389 ticks=100 > utime=1 stime=1309 ticks=100 > utime=984 stime=0 ticks=100 > utime=0 stime=1554 ticks=100 > utime=22 stime=1209 ticks=100 > utime=1015 stime=0 ticks=100 > utime=0 stime=1316 ticks=100 > utime=2 stime=1299 ticks=100 > utime=637 stime=328 ticks=100 > utime=0 stime=1736 ticks=100 > utime=53 stime=1091 ticks=100 > utime=1080 stime=0 ticks=100 > utime=0 stime=1503 ticks=100 > utime=49 stime=909 ticks=100 > utime=864 stime=0 ticks=100 > utime=0 stime=1447 ticks=100 > utime=2 stime=1123 ticks=100 > utime=887 stime=0 ticks=100 > utime=0 stime=1583 ticks=100 > utime=20 stime=1104 ticks=100 > utime=568 stime=319 ticks=100 > utime=0 stime=1582 ticks=100 > utime=20 stime=1041 ticks=100 > utime=928 stime=0 ticks=100 > utime=0 stime=1380 ticks=100 > utime=27 stime=1061 ticks=100 > utime=911 stime=0 ticks=100 > utime=0 stime=1500 ticks=100 > utime=57 stime=1031 ticks=100 > utime=996 stime=0 ticks=100 > utime=0 stime=1678 ticks=100 > utime=3 stime=1018 ticks=100 > utime=976 stime=0 ticks=100 > utime=0 stime=1164 ticks=100 > utime=0 stime=0 ticks=100 > utime=39 stime=617 ticks=100 > utime=938 stime=0 ticks=100 > utime=0 stime=1276 ticks=100 > utime=0 stime=0 ticks=100 > utime=42 stime=667 ticks=100 > utime=919 stime=0 ticks=100 > utime=0 stime=1308 ticks=100 > utime=0 stime=0 ticks=100 > utime=24 stime=747 ticks=100 > utime=1156 stime=0 ticks=100 > utime=0 stime=1863 ticks=100 > utime=0 stime=0 ticks=100 > utime=8 stime=616 ticks=100 > utime=894 stime=0 ticks=100 > utime=0 stime=1036 ticks=100 > utime=0 stime=0 ticks=100 > utime=6 stime=740 ticks=100 > utime=933 stime=0 ticks=100 > utime=0 stime=975 ticks=100 > utime=0 stime=0 ticks=100 > utime=140 stime=491 ticks=100 > utime=942 stime=0 ticks=100 > utime=0 stime=1117 ticks=100 > utime=0 stime=0 ticks=100 > utime=116 stime=578 ticks=100 > utime=869 stime=0 ticks=100 > utime=0 stime=1121 ticks=100 > utime=0 stime=0 ticks=100 > utime=52 stime=1844674407370 ticks=100 > Try::get() but state == ERROR: Argument larger than the maximum number of > seconds that a Duration can represent due to int64_t's size limit. > *** Aborted at 1372709810 (unix time) try "date -d @1372709810" if you are > using GNU date *** > PC: @ 0x7f80329a7425 (unknown) > *** SIGABRT (@0x3e4c) received by PID 15948 (TID 0x7f802a1e5700) from PID > 15948; stack trace: *** > @ 0x7f8032d3fcb0 (unknown) > @ 0x7f80329a7425 (unknown) > @ 0x7f80329aab8b (unknown) > @ 0x7f803384c63a os::process() > @ 0x7f803384e702 os::processes() > @ 0x7f80338507bc os::children() > @ 0x7f80338446b3 mesos::internal::slave::ProcessIsolator::usage() > @ 0x7f80337cf690 std::tr1::_Function_handler<>::_M_invoke() > @ 0x7f80337dd606 process::internal::pdispatcher<>() > @ 0x7f80337d08e8 std::tr1::_Function_handler<>::_M_invoke() > @ 0x7f80339c215c process::ProcessManager::resume() > @ 0x7f80339c2dbc process::schedule() > @ 0x7f8032d37e9a start_thread > @ 0x7f8032a64ccd (unknown) > I0701 20:16:54.926120 16220 main.cpp:119] Creating "process" isolator > I0701 20:16:54.926836 16220 main.cpp:127] Build: 2013-06-15 01:24:12 by > root > I0701 20:16:54.926872 16220 main.cpp:128] Starting Mesos slave On Mon, Jul 1, 2013 at 12:30 PM, Brenden Matthews < brenden.matthews@airbedandbreakfast.com> wrote: > Not yet! As soon as I have some samples I'll pass them along. > > > On Mon, Jul 1, 2013 at 12:25 PM, Benjamin Mahler < > benjamin.mahler@gmail.com> wrote: > >> Cool, do you have the values handy? >> >> >> On Mon, Jul 1, 2013 at 12:18 PM, Brenden Matthews < >> brenden.matthews@airbedandbreakfast.com> wrote: >> >> > I did a quick patch to print out the values. It seems to happen >> frequently >> > on some slaves, and not at all on others. >> > >> > >> > On Mon, Jul 1, 2013 at 12:07 PM, Benjamin Mahler >> > wrote: >> > >> > > Looks likely, would have been nice if we printed it out! ;) >> > > >> > > I'm curious, is this something rare? Or is it crashing consistently? >> The >> > > former would point to something odd with the values in >> /proc//stat, >> > > the latter would point to an issue with our code. >> > > >> > > I can send out a fix today, but without being able to reproduce it >> would >> > > simply be ignoring the strange utime / stime values. >> > > >> > > >> > > On Mon, Jul 1, 2013 at 11:12 AM, Brenden Matthews < >> > > brenden.matthews@airbedandbreakfast.com> wrote: >> > > >> > > > Hey guys, >> > > > >> > > > I'm getting another slave crash with the process usage stuff. >> Here's >> > the >> > > > log: >> > > > >> > > > I0701 16:44:51.263236 11682 slave.cpp:528] New master detected at >> > > > > master@10.58.41.218:5050 >> > > > > I0701 16:44:51.263598 11672 gc.cpp:56] Scheduling >> > > > > '/tmp/mesos/slaves/201306291951-3660134922-5050-13580-1538' for >> > removal >> > > > > I0701 16:44:51.264078 11676 status_update_manager.cpp:155] New >> master >> > > > > detected at master@10.58.41.218:5050 >> > > > > I0701 16:44:51.283917 11672 slave.cpp:588] Registered with master >> > > > > master@10.58.41.218:5050; given slave ID >> > > > > 201306291951-3660134922-5050-13580-2759 >> > > > > I0701 16:44:52.194198 11657 slave.cpp:1413] Got registration for >> > > executor >> > > > > 'executor_Task_Tracker_37434' of framework >> > > > > 201306291951-3660134922-5050-13580-0002 >> > > > > W0701 16:44:52.194744 11657 slave.cpp:1438] Shutting down >> executor >> > > > > 'executor_Task_Tracker_37434' as the framework >> > > > > 201306291951-3660134922-5050-13580-0002 does not >> > > > > exist >> > > > > I0701 16:44:56.630949 11666 slave.cpp:738] Got assigned task >> > > > > Task_Tracker_37439 for framework >> > > 201306291951-3660134922-5050-13580-0002 >> > > > > I0701 16:44:56.632294 11666 slave.cpp:836] Launching task >> > > > > Task_Tracker_37439 for framework >> > > 201306291951-3660134922-5050-13580-0002 >> > > > > I0701 16:44:56.634282 11666 paths.hpp:303] Created executor >> directory >> > > > > >> > > > >> > > >> > >> '/tmp/mesos/slaves/201306291951-3660134922-5050-13580-2759/frameworks/201306291951-3660134922- >> > > > > >> > > > > >> > > > >> > > >> > >> 5050-13580-0002/executors/executor_Task_Tracker_37439/runs/cf5d7062-b1cd-4da6-8b67-e7d3caa8bc9d' >> > > > > I0701 16:44:56.634918 11666 slave.cpp:947] Queuing task >> > > > > 'Task_Tracker_37439' for executor executor_Task_Tracker_37439 of >> > > > framework >> > > > > '201306291951-3660134922-5050-135 >> > > > > 80-0002 >> > > > > I0701 16:44:56.634908 11683 process_isolator.cpp:99] Launching >> > > > > executor_Task_Tracker_37439 (cd hadoop-* && ./bin/mesos-executor) >> in >> > > > > /tmp/mesos/slaves/201306291951-3 >> > > > > >> > > > >> > > >> > >> 660134922-5050-13580-2759/frameworks/201306291951-3660134922-5050-13580-0002/executors/executor_Task_Tracker_37439/runs/cf5d7062-b1cd-4da6-8b67-e7d3caa8bc9d >> > > > > with re >> > > > > sources cpus=1; mem=5000' for framework >> > > > > 201306291951-3660134922-5050-13580-0002 >> > > > > I0701 16:44:56.637537 11666 slave.cpp:510] Successfully attached >> file >> > > > > >> > > > >> > > >> > >> '/tmp/mesos/slaves/201306291951-3660134922-5050-13580-2759/frameworks/201306291951-3660134922-5050-13580-0002/executors/executor_Task_Tracker_37439/runs/cf5d7062-b1cd-4da6-8b67-e7d3caa8bc9d' >> > > > > I0701 16:44:56.637923 11683 process_isolator.cpp:161] Forked >> executor >> > > at >> > > > > 11809 >> > > > > Fetching resources into >> > > > > >> > > > >> > > >> > >> '/tmp/mesos/slaves/201306291951-3660134922-5050-13580-2759/frameworks/201306291951-3660134922-5050-13580-0002/executors/executor_Task_Tracker_37439/runs/cf5d7062-b1cd-4da6-8b67-e7d3caa8bc9d' >> > > > > Fetching resource >> > > > 'hdfs://airfs-h1/hadoop-2.0.0-mr1-cdh4.2.1-mesos.tar.xz' >> > > > > Downloading resource from >> > > > > 'hdfs://airfs-h1/hadoop-2.0.0-mr1-cdh4.2.1-mesos.tar.xz' >> > > > > HDFS command: hadoop fs -copyToLocal >> > > > > 'hdfs://airfs-h1/hadoop-2.0.0-mr1-cdh4.2.1-mesos.tar.xz' >> > > > > './hadoop-2.0.0-mr1-cdh4.2.1-mesos.tar.xz' >> > > > > Extracting resource: tar xJf >> > './hadoop-2.0.0-mr1-cdh4.2.1-mesos.tar.xz' >> > > > > Try::get() but state == ERROR: Argument larger than the maximum >> > number >> > > of >> > > > > seconds that a Duration can represent due to int64_t's size limit. >> > > > > *** Aborted at 1372697101 (unix time) try "date -d @1372697101" if >> > you >> > > > are >> > > > > using GNU date *** >> > > > > PC: @ 0x7f907ac82425 (unknown) >> > > > > *** SIGABRT (@0x2d69) received by PID 11625 (TID 0x7f906e4b8700) >> from >> > > PID >> > > > > 11625; stack trace: *** >> > > > > @ 0x7f907b01acb0 (unknown) >> > > > > @ 0x7f907ac82425 (unknown) >> > > > > @ 0x7f907ac85b8b (unknown) >> > > > > @ 0x7f907bb274ea os::process() >> > > > > @ 0x7f907bb296d2 os::processes() >> > > > > @ 0x7f907bb2b78c os::children() >> > > > > @ 0x7f907bb1f5d3 >> > > mesos::internal::slave::ProcessIsolator::usage() >> > > > > @ 0x7f907baaa5b0 >> std::tr1::_Function_handler<>::_M_invoke() >> > > > > @ 0x7f907bab8526 process::internal::pdispatcher<>() >> > > > > @ 0x7f907baab808 >> std::tr1::_Function_handler<>::_M_invoke() >> > > > > @ 0x7f907bc9d17c process::ProcessManager::resume() >> > > > > @ 0x7f907bc9dddc process::schedule() >> > > > > @ 0x7f907b012e9a start_thread >> > > > > @ 0x7f907ad3fccd (unknown) >> > > > > I0701 16:45:02.274014 11899 main.cpp:119] Creating "process" >> isolator >> > > > > I0701 16:45:02.274749 11899 main.cpp:127] Build: 2013-06-18 >> 01:38:35 >> > by >> > > > > I0701 16:45:02.274782 11899 main.cpp:128] Starting Mesos slave >> > > > >> > > > >> > > > >> > > > Here's the gdb backtrace: >> > > > >> > > > (gdb) bt >> > > > > #0 0x00007f8da4577425 in raise () from >> > /lib/x86_64-linux-gnu/libc.so.6 >> > > > > #1 0x00007f8da457ab8b in abort () from >> > /lib/x86_64-linux-gnu/libc.so.6 >> > > > > #2 0x00007f8da541c4ea in get (this=0x7f8d965a91e0) at >> > > > > ../../3rdparty/libprocess/3rdparty/stout/include/stout/try.hpp:66 >> > > > > #3 os::process (pid=) at >> > > > > >> > ../../3rdparty/libprocess/3rdparty/stout/include/stout/os/linux.hpp:57 >> > > > > #4 0x00007f8da541e6d2 in os::processes () at >> > > > > ../../3rdparty/libprocess/3rdparty/stout/include/stout/os.hpp:984 >> > > > > #5 0x00007f8da542078c in os::children (pid=12260, >> recursive=true) at >> > > > > ../../3rdparty/libprocess/3rdparty/stout/include/stout/os.hpp:997 >> > > > > #6 0x00007f8da54145d3 in >> > > mesos::internal::slave::ProcessIsolator::usage >> > > > > (this=, frameworkId=..., executorId=...) >> > > > > at ../../src/slave/process_isolator.cpp:396 >> > > > > #7 0x00007f8da539f5b0 in operator() (__args#1=..., __args#0=..., >> > > > > this=, __object=) at >> > > > > /usr/include/c++/4.6/tr1/functional:572 >> > > > > #8 __call >> (__args=..., >> > > > > this=) at /usr/include/c++/4.6/tr1/functional:1153 >> > > > > #9 operator() (this=> > > out>) >> > > > > at /usr/include/c++/4.6/tr1/functional:1207 >> > > > > #10 >> > > > >> std::tr1::_Function_handler >> > > > > (mesos::internal::slave::Isolator*), >> > > > > >> > > > >> > > >> > >> std::tr1::_Bind >> > > > > (mesos::internal::slave::Isolator::*)(mesos::FrameworkID const&, >> > > > > mesos::ExecutorID const&)> (std::tr1::_Placeholder<1>, >> > > > mesos::FrameworkID, >> > > > > mesos::ExecutorID)> >::_M_invoke(std::tr1::_Any_data const&, >> > > > > mesos::internal::slave::Isolator*) (__functor=..., >> > __args#0=> > > > out>) >> > > > > at /usr/include/c++/4.6/tr1/functional:1670 >> > > > > #11 0x00007f8da53ad526 in operator() (__args#0=, >> > > > > this=) at /usr/include/c++/4.6/tr1/functional:2040 >> > > > > #12 process::internal::pdispatcher> > > > > mesos::internal::slave::Isolator>(process::ProcessBase*, >> > > > > >> > > > >> > > >> > >> std::tr1::shared_ptr >> > > > > (mesos::internal::slave::Isolator*)> >, >> > > > > std::tr1::shared_ptr >> >) ( >> > > > > process=, thunk=..., promise=...) at >> > > > > ../../3rdparty/libprocess/include/process/dispatch.hpp:86 >> > > > > #13 0x00007f8da53a0808 in __call >> > > > > (__args=..., this=) at >> > > > > /usr/include/c++/4.6/tr1/functional:1153 >> > > > > #14 operator() (this=) at >> > > > > /usr/include/c++/4.6/tr1/functional:1207 >> > > > > #15 std::tr1::_Function_handler> > > > > std::tr1::_Bind, >> > > > > >> > > > >> > > >> > >> std::tr1::shared_ptr >> > > > > (mesos::internal::slave::Isolator*)> >, >> > > > > std::tr1::shared_ptr >> > > > > >))(process::ProcessBase*, >> > > > > >> > > > >> > > >> > >> std::tr1::shared_ptr >> > > > > (mesos::internal::slave::Isolator*)> >, >> > > > > std::tr1::shared_ptr >> >)> >> > > > > >::_M_invoke(std::tr1::_Any_data const&, process::ProcessBase*) >> > > > > (__functor=..., __args#0=) >> > > > > at /usr/include/c++/4.6/tr1/functional:1684 >> > > > > #16 0x00007f8da559217c in process::ProcessManager::resume >> > > (this=0xeedf20, >> > > > > process=0xf03df8) at >> > ../../../3rdparty/libprocess/src/process.cpp:2446 >> > > > > #17 0x00007f8da5592ddc in process::schedule (arg=) >> at >> > > > > ../../../3rdparty/libprocess/src/process.cpp:1175 >> > > > > #18 0x00007f8da4907e9a in start_thread () from >> > > > > /lib/x86_64-linux-gnu/libpthread.so.0 >> > > > > #19 0x00007f8da4634ccd in clone () from >> > /lib/x86_64-linux-gnu/libc.so.6 >> > > > > #20 0x0000000000000000 in ?? () >> > > > > (gdb) >> > > > >> > > > >> > > > >> > > > It looks like either `ticks` or `utime`/`stime` has an erroneous >> value. >> > > > Thoughts? >> > > > >> > > >> > >> > > --089e0149cd4a59454c04e078ede9--