Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 79FE3200BB1 for ; Thu, 29 Sep 2016 02:04:23 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id 78868160ADD; Thu, 29 Sep 2016 00:04:23 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 8FAD3160AD4 for ; Thu, 29 Sep 2016 02:04:22 +0200 (CEST) Received: (qmail 41148 invoked by uid 500); 29 Sep 2016 00:04:21 -0000 Mailing-List: contact issues-help@mesos.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@mesos.apache.org Delivered-To: mailing list issues@mesos.apache.org Received: (qmail 41038 invoked by uid 99); 29 Sep 2016 00:04:21 -0000 Received: from arcas.apache.org (HELO arcas) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 29 Sep 2016 00:04:21 +0000 Received: from arcas.apache.org (localhost [127.0.0.1]) by arcas (Postfix) with ESMTP id 7A22D2C2AB7 for ; Thu, 29 Sep 2016 00:04:21 +0000 (UTC) Date: Thu, 29 Sep 2016 00:04:21 +0000 (UTC) From: "Avinash Sridharan (JIRA)" To: issues@mesos.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Created] (MESOS-6270) Agent crashes when trying to recover pods. MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 archived-at: Thu, 29 Sep 2016 00:04:23 -0000 Avinash Sridharan created MESOS-6270: ---------------------------------------- Summary: Agent crashes when trying to recover pods. Key: MESOS-6270 URL: https://issues.apache.org/jira/browse/MESOS-6270 Project: Mesos Issue Type: Bug Components: containerization Affects Versions: 1.1.0 Reporter: Avinash Sridharan Assignee: Gilbert Song The `MesosContainerizer` seems to be crashing when its trying to recover pods after a restart. Seems like the containerizer is unable to find the parent of a container it needs to destroy during recovery. ``` vagrant@mesos-dev:~/mesosphere/mesos/build$ startagentcni WARNING: Logging before InitGoogleLogging() is written to STDERR I0928 23:46:49.638267 31124 main.cpp:243] Build: 2016-09-05 16:37:12 by vagrant I0928 23:46:49.639432 31124 main.cpp:244] Version: 1.1.0 I0928 23:46:49.639801 31124 main.cpp:251] Git SHA: 97e48e4af0bf497cfe148bbeba2e2eace1a030d3 I0928 23:46:49.642931 31124 process.cpp:1069] libprocess is initialized on 10.0.2.15:5051 with 8 worker threads I0928 23:46:49.645144 31124 logging.cpp:199] Logging to STDERR I0928 23:46:49.649516 31124 containerizer.cpp:226] Using isolation: filesystem/linux,docker/runtime,network/cni,volume/image I0928 23:46:49.654670 31124 linux_launcher.cpp:144] Using /sys/fs/cgroup/freezer as the freezer hierarchy for the Linux launcher E0928 23:46:49.662094 31124 shell.hpp:110] Command 'hadoop version 2>&1' failed; this is the output: sh: 1: hadoop: not found I0928 23:46:49.662240 31124 fetcher.cpp:69] Skipping URI fetcher plugin 'hadoop' as it could not be created: Failed to create HDFS client: Failed to execute 'hadoop version 2>&1'; the command was either not found or exited with a n on-zero exit status: 127 I0928 23:46:49.662992 31124 registry_puller.cpp:111] Creating registry puller with docker registry 'https://registry-1.docker.io' I0928 23:46:49.676386 31142 slave.cpp:208] Mesos agent started on (1)@10.0.2.15:5051 I0928 23:46:49.676834 31142 slave.cpp:209] Flags at startup: --appc_simple_discovery_uri_prefix="http://" --appc_store_dir="/tmp/mesos/store/appc" --authenticate_http_readonly="false" --authenticate_http_readwrite="false" --authent icatee="crammd5" --authentication_backoff_factor="1secs" --authorizer="local" --cgroups_cpu_enable_pids_and_tids_count="false" --cgroups_enable_cfs="false" --cgroups_hierarchy="/sys/fs/cgroup" --cgroups_limit_swap="false" --cgroups _root="mesos" --container_disk_watch_interval="15secs" --containerizers="mesos" --default_role="*" --disk_watch_interval="1mins" --docker="docker" --docker_kill_orphans="true" --docker_registry="https://registry-1.docker.io" --dock er_remove_delay="6hrs" --docker_socket="/var/run/docker.sock" --docker_stop_timeout="0ns" --docker_store_dir="/tmp/mesos/store/docker" --docker_volume_checkpoint_dir="/var/run/mesos/isolators/docker/volume" --enforce_container_disk _quota="false" --executor_registration_timeout="1mins" --executor_shutdown_grace_period="5secs" --fetcher_cache_dir="/tmp/mesos/fetch" --fetcher_cache_size="2GB" --frameworks_home="" --gc_delay="1weeks" --gc_disk_headroom="0.1" --h adoop_home="" --help="false" --hostname_lookup="true" --http_authenticators="basic" --http_command_executor="false" --image_providers="docker" --image_provisioner_backend="copy" --initialize_driver_logging="true" --ip="10.0.2.15" - -isolation="filesystem/linux,docker/runtime" --launcher="linux" --launcher_dir="/home/vagrant/mesosphere/mesos/build/src" --logbufsecs="0" --logging_level="INFO" --master="10.0.2.15:5050" --network_cni_config_dir="/home/vagrant/cni /config" --network_cni_plugins_dir="/home/vagrant/dev/go/cni/bin" --oversubscribed_resources_interval="15secs" --perf_duration="10secs" --perf_interval="1mins" --port="5051" --qos_correction_interval_min="0ns" --quiet="false" --rec over="reconnect" --recovery_timeout="15mins" --registration_backoff_factor="1secs" --revocable_cpu_low_priority="true" --runtime_dir="/var/run/mesos" --sandbox_directory="/mnt/mesos/sandbox" --strict="true" --switch_user="true" --s ystemd_enable_support="true" --systemd_runtime_directory="/run/systemd/system" --version="false" --work_dir="/var/lib/mesos" I0928 23:46:49.679481 31142 slave.cpp:533] Agent resources: cpus(*):4; mem(*):6961; disk(*):35164; ports(*):[31000-32000] I0928 23:46:49.679832 31142 slave.cpp:541] Agent attributes: [ ] I0928 23:46:49.680160 31142 slave.cpp:546] Agent hostname: mesos-dev I0928 23:46:49.685725 31147 state.cpp:57] Recovering state from '/var/lib/mesos/meta' I0928 23:46:49.687319 31147 fetcher.cpp:86] Clearing fetcher cache I0928 23:46:49.687815 31147 status_update_manager.cpp:203] Recovering status update manager I0928 23:46:49.688199 31147 containerizer.cpp:580] Recovering containerizer I0928 23:46:49.692178 31143 linux_launcher.cpp:288] Recovered container 9e94e2b0-c07d-4641-9b65-f77b1601c696.7222dd27-576c-4f0e-afa3-17ae9f93539c I0928 23:46:49.692248 31143 linux_launcher.cpp:288] Recovered container 9e94e2b0-c07d-4641-9b65-f77b1601c696.39caa07f-6dd2-4acd-87be-d400467890d3 I0928 23:46:49.692276 31143 linux_launcher.cpp:288] Recovered container 9e94e2b0-c07d-4641-9b65-f77b1601c696 I0928 23:46:49.692312 31143 linux_launcher.cpp:370] 9e94e2b0-c07d-4641-9b65-f77b1601c696 is a known orphaned container I0928 23:46:49.692335 31143 linux_launcher.cpp:370] 9e94e2b0-c07d-4641-9b65-f77b1601c696.39caa07f-6dd2-4acd-87be-d400467890d3 is a known orphaned container I0928 23:46:49.692355 31143 linux_launcher.cpp:370] 9e94e2b0-c07d-4641-9b65-f77b1601c696.7222dd27-576c-4f0e-afa3-17ae9f93539c is a known orphaned container I0928 23:46:49.694280 31147 containerizer.cpp:2147] Container 9e94e2b0-c07d-4641-9b65-f77b1601c696 has exited I0928 23:46:49.694339 31147 containerizer.cpp:1835] Destroying container 9e94e2b0-c07d-4641-9b65-f77b1601c696 I0928 23:46:49.695274 31145 linux_launcher.cpp:495] Asked to destroy container 9e94e2b0-c07d-4641-9b65-f77b1601c696 I0928 23:46:49.696429 31146 metadata_manager.cpp:205] No images to load from disk. Docker provisioner image storage path '/tmp/mesos/store/docker/storedImages' does not exist I0928 23:46:49.697399 31147 provisioner.cpp:253] Provisioner recovery complete F0928 23:46:49.697675 31142 containerizer.cpp:867] Check failed: containers_.contains(containerId.parent()) *** Check failure stack trace: *** F0928 23:46:49.697675 31142 containerizer.cpp:867] Check failed: containers_.contains(containerId.parent()) *** Check failure stack trace: *** @ 0x7f2e3bdd514d google::LogMessage::Fail() @ 0x7f2e3bdd452e google::LogMessage::SendToLog() @ 0x7f2e3bdd4e0d google::LogMessage::Flush() @ 0x7f2e3bdd8288 google::LogMessageFatal::~LogMessageFatal() @ 0x7f2e3b0a7562 mesos::internal::slave::MesosContainerizerProcess::__recover() @ 0x7f2e3b12dee8 _ZZN7process8dispatchI7NothingN5mesos8internal5slave25MesosContainerizerProcessERKSt4listINS2_5slave14ContainerStateESaIS8_EERK7hashsetINS2_11ContainerIDESt4hashISE_ESt8equal_toISE_EESA_SJ_EENS_6FutureIT_EERKNS_3PIDIT0_EEMSQ_FSO_T1_T2_ET3_T4_ENKUlPNS_11ProcessBaseEE_clES11_ @ 0x7f2e3b12da82 _ZNSt17_Function_handlerIFvPN7process11ProcessBaseEEZNS0_8dispatchI7NothingN5mesos8internal5slave25MesosContainerizerProcessERKSt4listINS6_5slave14ContainerStateESaISC_EERK7hashsetINS6_11ContainerIDESt4hashISI_ESt8equal_toISI_EESE_SN_EENS0_6FutureIT_EERKNS0_3PIDIT0_EEMSU_FSS_T1_T2_ET3_T4_EUlS2_E_E9_M_invokeERKSt9_Any_dataS2_ @ 0x7f2e3bd4cc68 std::function<>::operator()() @ 0x7f2e3bd34054 process::ProcessBase::visit() @ 0x7f2e3bd90b2e process::DispatchEvent::visit() @ 0x7f2e3a4f34c1 process::ProcessBase::serve() @ 0x7f2e3bd31d54 process::ProcessManager::resume() @ 0x7f2e3bd3cd0c process::ProcessManager::init_threads()::$_1::operator()() @ 0x7f2e3bd3cc15 _ZNSt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvE3$_1vEE9_M_invokeIJEEEvSt12_Index_tupleIJXspT_EEE @ 0x7f2e3bd3cbe5 std::_Bind_simple<>::operator()() @ 0x7f2e3bd3cbbc std::thread::_Impl<>::_M_run() @ 0x7f2e3765ca60 (unknown) @ 0x7f2e3717f182 start_thread @ 0x7f2e36eac47d (unknown) ``` -- This message was sent by Atlassian JIRA (v6.3.4#6332)