Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 3A450200C08 for ; Thu, 26 Jan 2017 19:37:34 +0100 (CET) Received: by cust-asf.ponee.io (Postfix) id 38C76160B4C; Thu, 26 Jan 2017 18:37:34 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 33910160B31 for ; Thu, 26 Jan 2017 19:37:33 +0100 (CET) Received: (qmail 73397 invoked by uid 500); 26 Jan 2017 18:37:32 -0000 Mailing-List: contact issues-help@mesos.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@mesos.apache.org Delivered-To: mailing list issues@mesos.apache.org Received: (qmail 73388 invoked by uid 99); 26 Jan 2017 18:37:32 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd1-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 26 Jan 2017 18:37:32 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd1-us-west.apache.org (ASF Mail Server at spamd1-us-west.apache.org) with ESMTP id A7C4BC0E5C for ; Thu, 26 Jan 2017 18:37:31 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd1-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: -1.997 X-Spam-Level: X-Spam-Status: No, score=-1.997 tagged_above=-999 required=6.31 tests=[KAM_LAZY_DOMAIN_SECURITY=1, NORMAL_HTTP_TO_IP=0.001, RP_MATCHES_RCVD=-2.999, URIBL_BLOCKED=0.001] autolearn=disabled Received: from mx1-lw-eu.apache.org ([10.40.0.8]) by localhost (spamd1-us-west.apache.org [10.40.0.7]) (amavisd-new, port 10024) with ESMTP id 2GB_UJJVrckQ for ; Thu, 26 Jan 2017 18:37:26 +0000 (UTC) Received: from mailrelay1-us-west.apache.org (mailrelay1-us-west.apache.org [209.188.14.139]) by mx1-lw-eu.apache.org (ASF Mail Server at mx1-lw-eu.apache.org) with ESMTP id 0A6E75FB46 for ; Thu, 26 Jan 2017 18:37:26 +0000 (UTC) Received: from jira-lw-us.apache.org (unknown [207.244.88.139]) by mailrelay1-us-west.apache.org (ASF Mail Server at mailrelay1-us-west.apache.org) with ESMTP id 17EB5E0410 for ; Thu, 26 Jan 2017 18:37:25 +0000 (UTC) Received: from jira-lw-us.apache.org (localhost [127.0.0.1]) by jira-lw-us.apache.org (ASF Mail Server at jira-lw-us.apache.org) with ESMTP id 7394525291 for ; Thu, 26 Jan 2017 18:37:24 +0000 (UTC) Date: Thu, 26 Jan 2017 18:37:24 +0000 (UTC) From: "Neil Conway (JIRA)" To: issues@mesos.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Assigned] (MESOS-7008) Incomplete recovery of roles leading to fatal CHECK failure MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 archived-at: Thu, 26 Jan 2017 18:37:34 -0000 [ https://issues.apache.org/jira/browse/MESOS-7008?page=3Dcom.atlassia= n.jira.plugin.system.issuetabpanels:all-tabpanel ] Neil Conway reassigned MESOS-7008: ---------------------------------- Assignee: Neil Conway > Incomplete recovery of roles leading to fatal CHECK failure > ----------------------------------------------------------- > > Key: MESOS-7008 > URL: https://issues.apache.org/jira/browse/MESOS-7008 > Project: Mesos > Issue Type: Bug > Components: master > Environment: OS X, SSL build > Reporter: Benjamin Bannier > Assignee: Neil Conway > Labels: quota, roles > > When a quota was set and the master is restarted, removal of the quota re= liably leads to a {{CHECK}} failure for me. > Start a master: > {code} > $ mesos-master --work_dir=3Dwork_dir > {code} > Set a quota. This creates an implicit role. > {code} > $ cat quota.json > { > "role": "role2", > "force": true, > "guarantee": [ > { > "name": "cpus", > "type": "SCALAR", > "scalar": { "value": 1 } > } > ] > } > $ cat quota.json| http POST :5050/quota > HTTP/1.1 200 OK > Content-Length: 0 > Date: Thu, 26 Jan 2017 12:33:38 GMT > $ http GET :5050/quota > HTTP/1.1 200 OK > Content-Length: 108 > Content-Type: application/json > Date: Thu, 26 Jan 2017 12:33:56 GMT > { > "infos": [ > { > "guarantee": [ > { > "name": "cpus", > "role": "*", > "scalar": { > "value": 1.0 > }, > "type": "SCALAR" > } > ], > "role": "role2" > } > ] > } > $ http GET :5050/roles > HTTP/1.1 200 OK > Content-Length: 106 > Content-Type: application/json > Date: Thu, 26 Jan 2017 12:34:10 GMT > { > "roles": [ > { > "frameworks": [], > "name": "role2", > "resources": { > "cpus": 0, > "disk": 0, > "gpus": 0, > "mem": 0 > }, > "weight": 1.0 > } > ] > } > {code} > Restart the master process using the same {{work_dir}} and attempt to del= ete the quota after the master is started. The {{DELETE}} succeeds with an = {{OK}}. > {code} > $ http DELETE :5050/quota/role2 > HTTP/1.1 200 OK > Content-Length: 0 > Date: Thu, 26 Jan 2017 12:36:04 GMT > {code} > After handling the request, the master hits a {{CHECK}} failure and is ab= orted. > {code} > $ mesos-master --work_dir=3Dwork_dir > WARNING: Logging before InitGoogleLogging() is written to STDERR > I0126 13:34:57.528599 3145483200 main.cpp:278] Build: 2017-01-23 07:57:34= by bbannier > I0126 13:34:57.529131 3145483200 main.cpp:279] Version: 1.2.0 > I0126 13:34:57.529139 3145483200 main.cpp:286] Git SHA: dd07d025d40975ec6= 60ed17031d95ec0dba842d2 > [warn] kq_init: detected broken kqueue; not using.: No such process > I0126 13:34:57.758896 3145483200 main.cpp:385] Using 'HierarchicalDRF' al= locator > I0126 13:34:57.764276 3145483200 replica.cpp:778] Replica recovered with = log positions 3 -> 4 with 0 holes and 0 unlearned > I0126 13:34:57.765278 256114688 recover.cpp:451] Starting replica recover= y > I0126 13:34:57.765547 256114688 recover.cpp:477] Replica is in VOTING sta= tus > I0126 13:34:57.795964 257187840 master.cpp:383] Master 569073cc-1195-45e9= -b0d4-e2e1bf0d13d5 (172.18.9.56) started on 172.18.9.56:5050 > I0126 13:34:57.796023 257187840 master.cpp:385] Flags at startup: --agent= _ping_timeout=3D"15secs" --agent_reregister_timeout=3D"10mins" --allocation= _interval=3D"1secs" --allocator=3D"HierarchicalDRF" --authenticate_agents= =3D"false" --authenticate_frameworks=3D"false" --authenticate_http_framewor= ks=3D"false" --authenticate_http_readonly=3D"false" --authenticate_http_rea= dwrite=3D"false" --authenticators=3D"crammd5" --authorizers=3D"local" --fra= mework_sorter=3D"drf" --help=3D"false" --hostname_lookup=3D"true" --http_au= thenticators=3D"basic" --initialize_driver_logging=3D"true" --log_auto_init= ialize=3D"true" --logbufsecs=3D"0" --logging_level=3D"INFO" --max_agent_pin= g_timeouts=3D"5" --max_completed_frameworks=3D"50" --max_completed_tasks_pe= r_framework=3D"1000" --max_unreachable_tasks_per_framework=3D"1000" --quiet= =3D"false" --recovery_agent_removal_limit=3D"100%" --registry=3D"replicated= _log" --registry_fetch_timeout=3D"1mins" --registry_gc_interval=3D"15mins" = --registry_max_agent_age=3D"2weeks" --registry_max_agent_count=3D"102400" -= -registry_store_timeout=3D"20secs" --registry_strict=3D"false" --root_submi= ssions=3D"true" --user_sorter=3D"drf" --version=3D"false" --webui_dir=3D"/u= sr/local/share/mesos/webui" --work_dir=3D"work_dir" --zk_session_timeout=3D= "10secs" > I0126 13:34:57.796478 257187840 master.cpp:437] Master allowing unauthent= icated frameworks to register > I0126 13:34:57.796507 257187840 master.cpp:451] Master allowing unauthent= icated agents to register > I0126 13:34:57.796517 257187840 master.cpp:465] Master allowing HTTP fram= eworks to register without authentication > I0126 13:34:57.796540 257187840 master.cpp:507] Using default 'crammd5' a= uthenticator > W0126 13:34:57.796573 257187840 authenticator.cpp:512] No credentials pro= vided, authentication requests will be refused > I0126 13:34:57.796584 257187840 authenticator.cpp:519] Initializing serve= r SASL > I0126 13:34:57.825337 255578112 master.cpp:2121] Elected as the leading m= aster! > I0126 13:34:57.825362 255578112 master.cpp:1643] Recovering from registra= r > I0126 13:34:57.825736 255578112 log.cpp:553] Attempting to start the writ= er > I0126 13:34:57.826889 258260992 replica.cpp:495] Replica received implici= t promise request from __req_res__(1)@172.18.9.56:5050 with proposal 2 > I0126 13:34:57.828855 258260992 replica.cpp:344] Persisted promised to 2 > I0126 13:34:57.829273 258260992 coordinator.cpp:238] Coordinator attempti= ng to fill missing positions > I0126 13:34:57.829375 259334144 log.cpp:569] Writer started with ending p= osition 4 > I0126 13:34:57.830878 257187840 registrar.cpp:362] Successfully fetched t= he registry (159B) in 5.427968ms > I0126 13:34:57.831029 257187840 registrar.cpp:461] Applied 1 operations i= n 24us; attempting to update the registry > I0126 13:34:57.836194 259334144 coordinator.cpp:348] Coordinator attempti= ng to write APPEND action at position 5 > I0126 13:34:57.836676 257724416 replica.cpp:539] Replica received write r= equest for position 5 from __req_res__(2)@172.18.9.56:5050 > I0126 13:34:57.837102 255578112 replica.cpp:693] Replica received learned= notice for position 5 from @0.0.0.0:0 > I0126 13:34:57.837745 257187840 registrar.cpp:506] Successfully updated t= he registry in 6.685184ms > I0126 13:34:57.837806 257187840 registrar.cpp:392] Successfully recovered= registrar > I0126 13:34:57.837924 255578112 coordinator.cpp:348] Coordinator attempti= ng to write TRUNCATE action at position 6 > I0126 13:34:57.838132 256651264 master.cpp:1759] Recovered 0 agents from = the registry (159B); allowing 10mins for agents to re-register > I0126 13:34:57.838312 257187840 replica.cpp:539] Replica received write r= equest for position 6 from __req_res__(3)@172.18.9.56:5050 > I0126 13:34:57.838692 256651264 replica.cpp:693] Replica received learned= notice for position 6 from @0.0.0.0:0 > I0126 13:36:04.887257 256114688 http.cpp:420] HTTP DELETE for /master/quo= ta/role2 from 127.0.0.1:51458 with User-Agent=3D'HTTPie/0.9.8' > I0126 13:36:04.887512 255578112 registrar.cpp:461] Applied 1 operations i= n 42us; attempting to update the registry > I0126 13:36:04.892643 255578112 coordinator.cpp:348] Coordinator attempti= ng to write APPEND action at position 7 > I0126 13:36:04.893127 258797568 replica.cpp:539] Replica received write r= equest for position 7 from __req_res__(4)@172.18.9.56:5050 > I0126 13:36:04.895309 257187840 replica.cpp:693] Replica received learned= notice for position 7 from @0.0.0.0:0 > I0126 13:36:04.895814 258260992 registrar.cpp:506] Successfully updated t= he registry in 8.2688ms > F0126 13:36:04.895956 256114688 hierarchical.cpp:1180] Check failed: quot= as.contains(role) > *** Check failure stack trace: *** > I0126 13:36:04.895961 255578112 coordinator.cpp:348] Coordinator attempti= ng to write TRUNCATE action at position 8 > I0126 13:36:04.896437 257187840 replica.cpp:539] Replica received write r= equest for position 8 from __req_res__(5)@172.18.9.56:5050 > I0126 13:36:04.896908 259334144 replica.cpp:693] Replica received learned= notice for position 8 from @0.0.0.0:0 > @ 0x10b5e52aa google::LogMessage::Fail() > E0126 13:36:04.905042 259870720 process.cpp:2419] Failed to shutdown sock= et with fd 11: Socket is not connected > @ 0x10b5e282c google::LogMessage::SendToLog() > @ 0x10b5e3959 google::LogMessage::Flush() > @ 0x10b5ee159 google::LogMessageFatal::~LogMessageFatal() > @ 0x10b5e5795 google::LogMessageFatal::~LogMessageFatal() > @ 0x1089e8d17 mesos::internal::master::allocator::internal::H= ierarchicalAllocatorProcess::removeQuota() > @ 0x107ebbc13 _ZZN7process8dispatchIN5mesos8internal6master9a= llocator21MesosAllocatorProcessERKNSt3__112basic_stringIcNS6_11char_traitsI= cEENS6_9allocatorIcEEEESC_EEvRKNS_3PIDIT_EEMSG_FvT0_ET1_ENKUlPNS_11ProcessB= aseEE_clESP_ > @ 0x107ebbab0 _ZNSt3__128__invoke_void_return_wrapperIvE6__ca= llIJRZN7process8dispatchIN5mesos8internal6master9allocator21MesosAllocatorP= rocessERKNS_12basic_stringIcNS_11char_traitsIcEENS_9allocatorIcEEEESF_EEvRK= NS3_3PIDIT_EEMSJ_FvT0_ET1_EUlPNS3_11ProcessBaseEE_SS_EEEvDpOT_ > @ 0x107ebb7b9 _ZNSt3__110__function6__funcIZN7process8dispatc= hIN5mesos8internal6master9allocator21MesosAllocatorProcessERKNS_12basic_str= ingIcNS_11char_traitsIcEENS_9allocatorIcEEEESE_EEvRKNS2_3PIDIT_EEMSI_FvT0_E= T1_EUlPNS2_11ProcessBaseEE_NSC_ISS_EEFvSR_EEclEOSR_ > @ 0x10b38ba27 std::__1::function<>::operator()() > @ 0x10b38b96c process::ProcessBase::visit() > @ 0x10b40415e process::DispatchEvent::visit() > @ 0x107665171 process::ProcessBase::serve() > @ 0x10b385c07 process::ProcessManager::resume() > @ 0x10b47db90 process::ProcessManager::init_threads()::$_0::o= perator()() > @ 0x10b47d7e0 _ZNSt3__114__thread_proxyINS_5tupleIJNS_10uniqu= e_ptrINS_15__thread_structENS_14default_deleteIS3_EEEEZN7process14ProcessMa= nager12init_threadsEvE3$_0EEEEEPvSB_ > @ 0x7fffb2b8eaab _pthread_body > @ 0x7fffb2b8e9f7 _pthread_start > @ 0x7fffb2b8e1fd thread_start > [2] 59343 abort mesos-master --work_dir=3Dwork_dir > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)