Return-Path: X-Original-To: apmail-incubator-mesos-dev-archive@minotaur.apache.org Delivered-To: apmail-incubator-mesos-dev-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id A125C10A1B for ; Thu, 13 Jun 2013 00:42:37 +0000 (UTC) Received: (qmail 65141 invoked by uid 500); 13 Jun 2013 00:42:37 -0000 Delivered-To: apmail-incubator-mesos-dev-archive@incubator.apache.org Received: (qmail 65060 invoked by uid 500); 13 Jun 2013 00:42:37 -0000 Mailing-List: contact mesos-dev-help@incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: mesos-dev@incubator.apache.org Delivered-To: mailing list mesos-dev@incubator.apache.org Received: (qmail 65052 invoked by uid 99); 13 Jun 2013 00:42:37 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 13 Jun 2013 00:42:37 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,NORMAL_HTTP_TO_IP,RCVD_IN_DNSWL_LOW,WEIRD_PORT X-Spam-Check-By: apache.org Received-SPF: error (nike.apache.org: local policy) Received: from [209.85.223.176] (HELO mail-ie0-f176.google.com) (209.85.223.176) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 13 Jun 2013 00:42:30 +0000 Received: by mail-ie0-f176.google.com with SMTP id ar20so22429532iec.21 for ; Wed, 12 Jun 2013 17:41:49 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20120113; h=mime-version:date:message-id:subject:from:to:content-type :x-gm-message-state; bh=N1ehpr0b0JAgBA5wdU+TPgZ6YL/DUVL31mEF1ZdHRBc=; b=VkCyVXG8tHDd4HgzGIcd5eDAXe/99TVRGGfJokNYPav+iKSc1OfNUtqbVhJY3k1wBX kfvn2oJ12/jpM02MZ2spIrALzWVdVv6WBm1ysUrz1887ybn4Ej34iZW3qvNGmxhm2iUQ L+Sbv6RMcxe6CQUEGVzq8IOkZFebLDdJ4AvFX5m6YmbVtGjaUUYlfX/AmQ+pGPq0oMX3 G60jKX+Cet8SfjWh2rVk9iMmgWWxvbfrIFC2xTY+dTS/v0KH/mfhjtjO0H8lIbuRtXco YDWSuzCh4LCe5ud8P47ONVHjlxVmDAdn/cNcXV5r7234+xFzI+JI9iZS/0iE0koWrwaA hWvg== MIME-Version: 1.0 X-Received: by 10.50.131.138 with SMTP id om10mr4567615igb.89.1371084109168; Wed, 12 Jun 2013 17:41:49 -0700 (PDT) Received: by 10.64.72.196 with HTTP; Wed, 12 Jun 2013 17:41:49 -0700 (PDT) Date: Wed, 12 Jun 2013 17:41:49 -0700 Message-ID: Subject: How to debug slave crashes From: Brenden Matthews To: mesos Content-Type: multipart/alternative; boundary=047d7b2e12db48d9d604defe6502 X-Gm-Message-State: ALoCoQnjUrUfDraPy0koBZNl/X3yT+Z1B6eefGx/H6dlULfDLYJMM06k5FeABpeSpz/WizjAUTc4 X-Virus-Checked: Checked by ClamAV on apache.org --047d7b2e12db48d9d604defe6502 Content-Type: text/plain; charset=ISO-8859-1 Hey guys, I was wondering how you typically debug slave crashes. I frequently have slaves crash, and I've had limited luck with collecting core dumps because of the frequency with which it occurs (it usually crashes repeatedly, and the cores get overwritten). I figured it would be quicker to just ask how you've been doing it in the past, rather than trying to reinvent the wheel. Here's a sample from the slave log that shows a crash: I0613 00:11:18.625746 1200 cgroups_isolator.cpp:864] Updated 'cpu.shares' > to 1024 for executor executor_Task_Tracker_129 of framework 20 > 1306122129-1707151626-5050-5724-0000 > I0613 00:11:18.629904 1200 cgroups_isolator.cpp:1002] Updated > 'memory.limit_in_bytes' to 2621440000 for executor executor_Task_Tracker_1 > 29 of framework 201306122129-1707151626-5050-5724-0000 > I0613 00:11:18.630421 1200 cgroups_isolator.cpp:1028] Started listening > for OOM events for executor executor_Task_Tracker_129 of framewo > rk 201306122129-1707151626-5050-5724-0000 > I0613 00:11:18.632015 1200 cgroups_isolator.cpp:560] Forked executor at = > 6789 > Fetching resources into > '/tmp/mesos/slaves/201306122021-1471680778-5050-4525-106/frameworks/201306122129-1707151626-5050-5724-0000/executors/executor_Task_Tracker_129/runs/f67609e4-cb85-45dc-bb6f-9668d292b81f' > Fetching resource 'hdfs://airfs-h1/hadoop-2.0.0-mr1-cdh4.2.1-mesos.tar.xz' > Downloading resource from > 'hdfs://airfs-h1/hadoop-2.0.0-mr1-cdh4.2.1-mesos.tar.xz' > HDFS command: hadoop fs -copyToLocal > 'hdfs://airfs-h1/hadoop-2.0.0-mr1-cdh4.2.1-mesos.tar.xz' > './hadoop-2.0.0-mr1-cdh4.2.1-mesos.tar.xz' > Extracting resource: tar xJf './hadoop-2.0.0-mr1-cdh4.2.1-mesos.tar.xz' > I0613 00:11:26.225594 1200 slave.cpp:1412] Got registration for executor > 'executor_Task_Tracker_129' of framework > 201306122129-1707151626-5050-5724-0000 > I0613 00:11:26.226132 1197 cgroups_isolator.cpp:667] Changing cgroup > controls for executor executor_Task_Tracker_129 of framework > 201306122129-1707151626-5050-5724-0000 with resources cpus=9.25; mem=21200; > disk=45056; ports=[31000-31000, 31001-31001] > I0613 00:11:26.226450 1200 slave.cpp:1527] Flushing queued task > Task_Tracker_129 for executor 'executor_Task_Tracker_129' of framework > 201306122129-1707151626-5050-5724-0000 > I0613 00:11:26.227196 1197 cgroups_isolator.cpp:939] Updated > 'cpu.cfs_period_us' to 100000 and 'cpu.cfs_quota_us' to 925000 for executor > executor_Task_Tracker_129 of framework > 201306122129-1707151626-5050-5724-0000 > I0613 00:11:26.227856 1197 cgroups_isolator.cpp:864] Updated 'cpu.shares' > to 9472 for executor executor_Task_Tracker_129 of framework > 201306122129-1707151626-5050-5724-0000 > I0613 00:11:26.228564 1197 cgroups_isolator.cpp:1002] Updated > 'memory.limit_in_bytes' to 22229811200 for executor > executor_Task_Tracker_129 of framework > 201306122129-1707151626-5050-5724-0000 > I0613 00:11:27.729706 1210 status_update_manager.cpp:290] Received status > update TASK_RUNNING (UUID: e984eac5-a9d7-4546-a7e9-444f5f018cab) for task > Task_Tracker_129 of framework 201306122129-1707151626-5050-5724-0000 with > checkpoint=false > I0613 00:11:27.730159 1210 status_update_manager.cpp:450] Creating > StatusUpdate stream for task Task_Tracker_129 of framework > 201306122129-1707151626-5050-5724-0000 > I0613 00:11:27.730520 1210 status_update_manager.cpp:336] Forwarding > status update TASK_RUNNING (UUID: e984eac5-a9d7-4546-a7e9-444f5f018cab) for > task Task_Tracker_129 of framework 201306122129-1707151626-5050-5724-0000 > to master@10.17.193.101:5050 > I0613 00:11:27.731010 1211 slave.cpp:1821] Sending acknowledgement for > status update TASK_RUNNING (UUID: e984eac5-a9d7-4546-a7e9-444f5f018cab) for > task Task_Tracker_129 of framework 201306122129-1707151626-5050-5724-0000 > to executor(1)@10.17.178.97:58109 > I0613 00:11:27.735405 1211 status_update_manager.cpp:360] Received status > update acknowledgement e984eac5-a9d7-4546-a7e9-444f5f018cab for task > Task_Tracker_129 of framework 201306122129-1707151626-5050-5724-0000 > I0613 00:11:47.578543 1202 slave.cpp:2514] Current usage 16.54%. Max > allowed age: 5.142519016183785days > I0613 00:12:47.579776 1202 slave.cpp:2514] Current usage 16.90%. Max > allowed age: 5.117054416194364days > I0613 00:13:47.581866 1202 slave.cpp:2514] Current usage 17.29%. Max > allowed age: 5.089524142974595days > I0613 00:14:47.585180 1209 slave.cpp:2514] Current usage 17.61%. Max > allowed age: 5.067278719463831days > I0613 00:15:47.585908 1206 slave.cpp:2514] Current usage 17.87%. Max > allowed age: 5.049103921517002days > I0613 00:16:47.588693 1207 slave.cpp:2514] Current usage 17.97%. Max > allowed age: 5.041750330234468days > I0613 00:17:28.162240 1212 slave.cpp:1896] master@10.17.193.101:5050exited > W0613 00:17:28.162505 1212 slave.cpp:1899] Master disconnected! Waiting > for a new master to be elected > I0613 00:17:38.006042 1201 detector.cpp:420] Master detector (slave(1)@ > 10.17.178.97:5051) found 1 registered masters > I0613 00:17:38.008129 1201 detector.cpp:467] Master detector (slave(1)@ > 10.17.178.97:5051) got new master pid: master@10.17.184.87:5050 > I0613 00:17:38.008599 1201 slave.cpp:537] New master detected at > master@10.17.184.87:5050 > I0613 00:17:38.009098 1198 status_update_manager.cpp:155] New master > detected at master@10.17.184.87:5050 > I0613 00:17:39.059538 1203 slave.cpp:633] Re-registered with master > master@10.17.184.87:5050 > I0613 00:17:39.059859 1203 slave.cpp:1294] Updating framework > 201306122129-1707151626-5050-5724-0000 pid to scheduler(1)@ > 10.17.184.87:57804 > I0613 00:17:46.057116 1202 detector.cpp:420] Master detector (slave(1)@ > 10.17.178.97:5051) found 2 registered masters > I0613 00:17:47.590699 1200 slave.cpp:2514] Current usage 18.06%. Max > allowed age: 5.036055129997118days > I0613 00:18:47.592268 1201 slave.cpp:2514] Current usage 17.98%. Max > allowed age: 5.041545235716157days > I0613 00:19:47.596873 1204 slave.cpp:2514] Current usage 17.86%. Max > allowed age: 5.049504905524051days > I0613 00:20:47.597520 1208 slave.cpp:2514] Current usage 17.86%. Max > allowed age: 5.049908279608947days > I0613 00:21:47.598794 1206 slave.cpp:2514] Current usage 17.55%. Max > allowed age: 5.071801618813565days > I0613 00:22:47.599805 1202 slave.cpp:2514] Current usage 17.56%. Max > allowed age: 5.070852822503368days > I0613 00:23:47.604342 1199 slave.cpp:2514] Current usage 17.56%. Max > allowed age: 5.070920390650185days > I0613 00:24:47.605106 1203 slave.cpp:2514] Current usage 17.56%. Max > allowed age: 5.070920003070000days > *** Aborted at 1371083126 (unix time) try "date -d @1371083126" if you are > using GNU date *** > PC: @ 0x7f53c240dd84 __pthread_cond_wait > *** SIGTERM (@0x409) received by PID 1187 (TID 0x7f53c37b8740) from PID > 1033; stack trace: *** > @ 0x7f53c2411cb0 (unknown) > @ 0x7f53c240dd84 __pthread_cond_wait > @ 0x7f53c3088f03 (unknown) > @ 0x7f53c308961f (unknown) > @ 0x40c75a (unknown) > @ 0x7f53c206476d (unknown) > @ 0x40d511 (unknown) > I0613 00:25:26.600497 9946 main.cpp:119] Creating "cgroups" isolator > I0613 00:25:26.622987 9946 main.cpp:127] Build: 2013-05-09 22:53:54 by > I0613 00:25:26.623016 9946 main.cpp:128] Starting Mesos slave --047d7b2e12db48d9d604defe6502--