mesos-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Peter Kolloch (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (MESOS-3744) Master crashes when tearing down framework
Date Thu, 15 Oct 2015 13:42:05 GMT

    [ https://issues.apache.org/jira/browse/MESOS-3744?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14958920#comment-14958920
] 

Peter Kolloch commented on MESOS-3744:
--------------------------------------

Probably a duplicate of MESOS-3719

> Master crashes when tearing down framework
> ------------------------------------------
>
>                 Key: MESOS-3744
>                 URL: https://issues.apache.org/jira/browse/MESOS-3744
>             Project: Mesos
>          Issue Type: Bug
>          Components: allocation
>    Affects Versions: 0.23.0
>            Reporter: Peter Kolloch
>         Attachments: master-fail.log
>
>
> The crash happened shortly after calling teardown. The teardown was initiated by using
httpie with:
> http -f -v POST "$MASTER_BASE_URL/teardown" "frameworkId=$FRAMEWORK"
> Below you will find the master-fail.log over the relevant time interval. Here are the
last log lines before the mesos master died:
> Oct 15 13:13:21 ip-10-0-4-219.us-west-2.compute.internal mesos-master[23032]: F1015 13:13:21.511503
23038 sorter.cpp:213] Check failed: total.resources.contains(slaveId)
> Oct 15 13:13:21 ip-10-0-4-219.us-west-2.compute.internal mesos-master[23032]: *** Check
failure stack trace: ***
> Oct 15 13:13:21 ip-10-0-4-219.us-west-2.compute.internal mesos-master[23032]: @     0x7fd1860169fd
 google::LogMessage::Fail()
> Oct 15 13:13:21 ip-10-0-4-219.us-west-2.compute.internal mesos-master[23032]: @     0x7fd18601889d
 google::LogMessage::SendToLog()
> Oct 15 13:13:21 ip-10-0-4-219.us-west-2.compute.internal mesos-master[23032]: @     0x7fd1860165ec
 google::LogMessage::Flush()
> Oct 15 13:13:21 ip-10-0-4-219.us-west-2.compute.internal mesos-master[23032]: @     0x7fd1860191be
 google::LogMessageFatal::~LogMessageFatal()
> Oct 15 13:13:21 ip-10-0-4-219.us-west-2.compute.internal mesos-master[23032]: @     0x7fd186af3ea0
 mesos::internal::master::allocator::DRFSorter::remove()
> Oct 15 13:13:21 ip-10-0-4-219.us-west-2.compute.internal mesos-master[23032]: @     0x7fd1869d6dec
 mesos::internal::master::allocator::HierarchicalAllocatorProcess<>::removeFramework()
> Oct 15 13:13:21 ip-10-0-4-219.us-west-2.compute.internal mesos-master[23032]: @     0x7fd186fbdab9
 process::ProcessManager::resume()
> Oct 15 13:13:21 ip-10-0-4-219.us-west-2.compute.internal mesos-master[23032]: @     0x7fd186fbddaf
 process::schedule()
> Oct 15 13:13:21 ip-10-0-4-219.us-west-2.compute.internal mesos-master[23032]: @     0x7fd1852bc66c
 (unknown)
> Oct 15 13:13:21 ip-10-0-4-219.us-west-2.compute.internal mesos-master[23032]: @     0x7fd184fff2ed
 (unknown)
> I am not sure if it matters but in this case multiple framework instances registered
with the same framework name.
> Here is an excerpt of the startup of the effected mesos master version because it does
contain the software versions in use:
> Oct 15 13:13:37 ip-10-0-4-219.us-west-2.compute.internal mesos-master[18936]: I1015 13:13:37.454946
18936 logging.cpp:172] INFO level logging started!
> Oct 15 13:13:37 ip-10-0-4-219.us-west-2.compute.internal mesos-master[18936]: I1015 13:13:37.455173
18936 main.cpp:181] Build: 2015-09-28 19:50:01 by
> Oct 15 13:13:37 ip-10-0-4-219.us-west-2.compute.internal mesos-master[18936]: I1015 13:13:37.455199
18936 main.cpp:183] Version: 0.23.0
> Oct 15 13:13:37 ip-10-0-4-219.us-west-2.compute.internal mesos-master[18936]: I1015 13:13:37.455215
18936 main.cpp:190] Git SHA: 7d15294f46b5062c59818f4d062044ac04349dc1
> Oct 15 13:13:37 ip-10-0-4-219.us-west-2.compute.internal mesos-master[18936]: I1015 13:13:37.455294
18936 main.cpp:204] Using 'HierarchicalDRF' allocator
> Oct 15 13:13:38 ip-10-0-4-219.us-west-2.compute.internal mesos-master[18936]: I1015 13:13:38.016752
18936 leveldb.cpp:176] Opened db in 561.344642ms
> Oct 15 13:13:38 ip-10-0-4-219.us-west-2.compute.internal mesos-master[18936]: I1015 13:13:38.158462
18936 leveldb.cpp:183] Compacted db in 141.288563ms
> Oct 15 13:13:38 ip-10-0-4-219.us-west-2.compute.internal mesos-master[18936]: I1015 13:13:38.158534
18936 leveldb.cpp:198] Created db iterator in 13783ns
> Oct 15 13:13:38 ip-10-0-4-219.us-west-2.compute.internal mesos-master[18936]: I1015 13:13:38.158572
18936 leveldb.cpp:204] Seeked to beginning of db in 10366ns
> Oct 15 13:13:38 ip-10-0-4-219.us-west-2.compute.internal mesos-master[18936]: I1015 13:13:38.158673
18936 leveldb.cpp:273] Iterated through 3 keys in the db in 78606ns
> Oct 15 13:13:38 ip-10-0-4-219.us-west-2.compute.internal mesos-master[18936]: I1015 13:13:38.158733
18936 replica.cpp:744] Replica recovered with log positions 125 -> 126 with 0 holes and
0 unlearned
> Oct 15 13:13:38 ip-10-0-4-219.us-west-2.compute.internal mesos-master[18936]: 2015-10-15
13:13:38,159:18936(0x7f052aee3700):ZOO_INFO@log_env@712: Client environment:zookeeper.version=zookeeper
C client 3.4.5
> Oct 15 13:13:38 ip-10-0-4-219.us-west-2.compute.internal mesos-master[18936]: 2015-10-15
13:13:38,159:18936(0x7f052aee3700):ZOO_INFO@log_env@716: Client environment:host.name=ip-10-0-4-219.us-west-2.compute.internal
> Oct 15 13:13:38 ip-10-0-4-219.us-west-2.compute.internal mesos-master[18936]: 2015-10-15
13:13:38,159:18936(0x7f052aee3700):ZOO_INFO@log_env@723: Client environment:os.name=Linux
> Oct 15 13:13:38 ip-10-0-4-219.us-west-2.compute.internal mesos-master[18936]: 2015-10-15
13:13:38,159:18936(0x7f052aee3700):ZOO_INFO@log_env@724: Client environment:os.arch=4.0.5
> Oct 15 13:13:38 ip-10-0-4-219.us-west-2.compute.internal mesos-master[18936]: 2015-10-15
13:13:38,159:18936(0x7f052aee3700):ZOO_INFO@log_env@725: Client environment:os.version=#2
SMP Fri Jul 10 01:01:50 UTC 2015
> Oct 15 13:13:38 ip-10-0-4-219.us-west-2.compute.internal mesos-master[18936]: 2015-10-15
13:13:38,159:18936(0x7f052aee3700):ZOO_INFO@log_env@733: Client environment:user.name=(null)
> Oct 15 13:13:38 ip-10-0-4-219.us-west-2.compute.internal mesos-master[18936]: 2015-10-15
13:13:38,159:18936(0x7f052aee3700):ZOO_INFO@log_env@741: Client environment:user.home=/root
> Oct 15 13:13:38 ip-10-0-4-219.us-west-2.compute.internal mesos-master[18936]: 2015-10-15
13:13:38,159:18936(0x7f052aee3700):ZOO_INFO@log_env@753: Client environment:user.dir=/
> Oct 15 13:13:38 ip-10-0-4-219.us-west-2.compute.internal mesos-master[18936]: 2015-10-15
13:13:38,159:18936(0x7f052aee3700):ZOO_INFO@zookeeper_init@786: Initiating client connection,
host=127.0.0.1:2181 sessionTimeout=10000 watcher=0x7f0532095480 sessionId=0 sessionPasswd=<null>
context=0x7f0504001130 flags=0
> Oct 15 13:13:38 ip-10-0-4-219.us-west-2.compute.internal mesos-master[18936]: I1015 13:13:38.160876
18936 main.cpp:383] Starting Mesos master
> Oct 15 13:13:38 ip-10-0-4-219.us-west-2.compute.internal mesos-master[18936]: 2015-10-15
13:13:38,161:18936(0x7f052bee5700):ZOO_INFO@log_env@712: Client environment:zookeeper.version=zookeeper
C client 3.4.5
> Oct 15 13:13:38 ip-10-0-4-219.us-west-2.compute.internal mesos-master[18936]: 2015-10-15
13:13:38,161:18936(0x7f052bee5700):ZOO_INFO@log_env@716: Client environment:host.name=ip-10-0-4-219.us-west-2.compute.internal
> Oct 15 13:13:38 ip-10-0-4-219.us-west-2.compute.internal mesos-master[18936]: 2015-10-15
13:13:38,161:18936(0x7f0528cd3700):ZOO_INFO@check_events@1703: initiated connection to server
[127.0.0.1:2181]
> Oct 15 13:13:38 ip-10-0-4-219.us-west-2.compute.internal mesos-master[18936]: I1015 13:13:38.161655
18936 master.cpp:368] Master 20151015-131338-3674472458-5050-18936 (10.0.4.219) started on
10.0.4.219:5050
> Oct 15 13:13:38 ip-10-0-4-219.us-west-2.compute.internal mesos-master[18936]: I1015 13:13:38.161357
18942 log.cpp:238] Attempting to join replica to ZooKeeper group
> Oct 15 13:13:38 ip-10-0-4-219.us-west-2.compute.internal mesos-master[18936]: 2015-10-15
13:13:38,161:18936(0x7f052aee3700):ZOO_INFO@log_env@712: Client environment:zookeeper.version=zookeeper
C client 3.4.5
> Oct 15 13:13:38 ip-10-0-4-219.us-west-2.compute.internal mesos-master[18936]: 2015-10-15
13:13:38,162:18936(0x7f052aee3700):ZOO_INFO@log_env@716: Client environment:host.name=ip-10-0-4-219.us-west-2.compute.internal
> Oct 15 13:13:38 ip-10-0-4-219.us-west-2.compute.internal mesos-master[18936]: I1015 13:13:38.162201
18936 master.cpp:370] Flags at startup: --allocation_interval="1secs" --allocator="HierarchicalDRF"
--authenticate="false" --authenticate_slaves="false" --authenticators="crammd5" --cluster="peter-p70wxd2"
--framework_sorter="drf" --help="false" --hostname="10.0.4.219" --initialize_driver_logging="true"
--ip="10.0.4.219" --log_auto_initialize="true" --log_dir="/var/log/mesos" --logbufsecs="0"
--logging_level="INFO" --max_slave_ping_timeouts="5" --port="5050" --quiet="false" --quorum="1"
--recovery_slave_removal_limit="100%" --registry="replicated_log" --registry_fetch_timeout="1mins"
--registry_store_timeout="5secs" --registry_strict="false" --roles="slave_public" --root_submissions="true"
--slave_ping_timeout="15secs" --slave_reregister_timeout="10mins" --user_sorter="drf" --version="false"
--webui_dir="/opt/mesosphere/packages/mesos--d43a8eb9946a5c1c5ec05fb21922a2fdf41775b2/share/mesos/webui"
--weights="slave_public=1" --work_dir="/var/lib/mesos/master" --zk="zk://127.0.0.1:2181/mesos"
--zk_session_timeout="10secs"
> Oct 15 13:13:38 ip-10-0-4-219.us-west-2.compute.internal mesos-master[18936]: I1015 13:13:38.162433
18936 master.cpp:417] Master allowing unauthenticated frameworks to register
> Oct 15 13:13:38 ip-10-0-4-219.us-west-2.compute.internal mesos-master[18936]: I1015 13:13:38.162454
18936 master.cpp:422] Master allowing unauthenticated slaves to register
> Oct 15 13:13:38 ip-10-0-4-219.us-west-2.compute.internal mesos-master[18936]: I1015 13:13:38.162480
18936 master.cpp:459] Using default 'crammd5' authenticator



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message