mesos-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Benno Evers (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (MESOS-9177) Mesos master segfaults when responding to /state requests.
Date Wed, 22 Aug 2018 19:30:00 GMT

    [ https://issues.apache.org/jira/browse/MESOS-9177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16589281#comment-16589281
] 

Benno Evers edited comment on MESOS-9177 at 8/22/18 7:29 PM:
-------------------------------------------------------------

As a preliminary update, I managed to narrow down the location of the segfault to this lambda
inside the FullFrameworkWriter:

{code}
      foreach (const Owned<Task>& task, framework_->completedTasks) {
        // Skip unauthorized tasks.
        if (!approvers_->approved<VIEW_TASK>(*task, framework_->info)) {
          continue;
        }

        writer->element(*task);
      }
{code}

or more precisely 

{code}
# _ZZNK5mesos8internal6master19FullFrameworkWriterclEPN4JSON12ObjectWriterEENKUlPNS3_11ArrayWriterEE1_clES7_
+ 0x203
1d0b913:       48 8b 51 08             mov    0x8(%rcx),%rdx
{code}

Since the Mesos cluster where this segfault was observed runs with a non-standard (and quite
low) value of --max_completed_tasks_per_framework=20, I tried reproducing the crash by starting
a mesos-master built from the same commit locally, using the `no-executor-framework` to run
many tasks, and repeatedly hitting the state endpoint on this master. While I was able to
overload the JSON renderer of my web browser, I didn't manage to reproduce the crash.

Next, I turned to reverse engineering the exact location of the crash, which seems to be happening
while trying to increase an `boost::circular_buffer::iterator` (i.e. the container of `Master::Framework::completedTasks`).
This indicates that we're probably pushing values into this container while simulaneously
iterating in another thread.

However, I still haven't figured out a theory for how this could happen, or how to induce
the crash locally, since all mutations seem to be happening on the Master actor and thus should
not be happening in parallel.


was (Author: bennoe):
As a preliminary update, I managed to narrow down the location of the segfault to this lambda
inside the FullFrameworkWriter:

{code}
      foreach (const Owned<Task>& task, framework_->completedTasks) {
        // Skip unauthorized tasks.
        if (!approvers_->approved<VIEW_TASK>(*task, framework_->info)) {
          continue;
        }

        writer->element(*task);
      }
{code}

Since the Mesos cluster where this segfault was observed runs with a non-standard (and quite
low) value of --max_completed_tasks_per_framework=20, I tried reproducing the crash by starting
a mesos-master built from the same commit locally, using the `no-executor-framework` to run
many tasks, and repeatedly hitting the state endpoint on this master. While I was able to
overload the JSON renderer of my web browser, I didn't manage to reproduce the crash.

Next, I turned to reverse engineering the exact location of the crash, which seems to be happening
while trying to increase an `boost::circular_buffer::iterator` (i.e. the container of `Master::Framework::completedTasks`).
This indicates that we're probably pushing values into this container while simulaneously
iterating in another thread.

However, I still haven't figured out a theory for how this could happen, or how to induce
the crash locally, since all mutations seem to be happening on the Master actor and thus should
not be happening in parallel.

> Mesos master segfaults when responding to /state requests.
> ----------------------------------------------------------
>
>                 Key: MESOS-9177
>                 URL: https://issues.apache.org/jira/browse/MESOS-9177
>             Project: Mesos
>          Issue Type: Bug
>          Components: master
>    Affects Versions: 1.7.0
>            Reporter: Alexander Rukletsov
>            Assignee: Benno Evers
>            Priority: Blocker
>              Labels: mesosphere
>
> {noformat}
>  *** SIGSEGV (@0x8) received by PID 66991 (TID 0x7f36792b7700) from PID 8; stack trace:
***
>  @     0x7f367e7226d0 (unknown)
>  @     0x7f3681266913 _ZZNK5mesos8internal6master19FullFrameworkWriterclEPN4JSON12ObjectWriterEENKUlPNS3_11ArrayWriterEE1_clES7_
>  @     0x7f3681266af0 _ZNSt17_Function_handlerIFvPN9rapidjson6WriterINS0_19GenericStringBufferINS0_4UTF8IcEENS0_12CrtAllocatorEEES4_S4_S5_Lj0EEEEZN4JSON8internal7jsonifyIZNK5mesos8internal6master19FullFrameworkWriterclEPNSA_12ObjectWriterEEUlPNSA_11ArrayWriterEE1_vEESt8functionIS9_ERKT_NSB_6PreferEEUlS8_E_E9_M_invokeERKSt9_Any_dataS8_
>  @     0x7f36812882d0 mesos::internal::master::FullFrameworkWriter::operator()()
>  @     0x7f36812889d0 _ZNSt17_Function_handlerIFvPN9rapidjson6WriterINS0_19GenericStringBufferINS0_4UTF8IcEENS0_12CrtAllocatorEEES4_S4_S5_Lj0EEEEZN4JSON8internal7jsonifyIN5mesos8internal6master19FullFrameworkWriterEvEESt8functionIS9_ERKT_NSB_6PreferEEUlS8_E_E9_M_invokeERKSt9_Any_dataS8_
>  @     0x7f368121aef0 _ZNSt17_Function_handlerIFvPN9rapidjson6WriterINS0_19GenericStringBufferINS0_4UTF8IcEENS0_12CrtAllocatorEEES4_S4_S5_Lj0EEEEZN4JSON8internal7jsonifyIZZZN5mesos8internal6master6Master4Http25processStateRequestsBatchEvENKUlRKN7process4http7RequestERKNSI_5OwnedINSD_15ObjectApproversEEEE_clESM_SR_ENKUlPNSA_12ObjectWriterEE_clESU_EUlPNSA_11ArrayWriterEE3_vEESt8functionIS9_ERKT_NSB_6PreferEEUlS8_E_E9_M_invokeERKSt9_Any_dataS8_
>  @     0x7f3681241be3 _ZZZN5mesos8internal6master6Master4Http25processStateRequestsBatchEvENKUlRKN7process4http7RequestERKNS4_5OwnedINS_15ObjectApproversEEEE_clES8_SD_ENKUlPN4JSON12ObjectWriterEE_clESH_
>  @     0x7f3681242760 _ZNSt17_Function_handlerIFvPN9rapidjson6WriterINS0_19GenericStringBufferINS0_4UTF8IcEENS0_12CrtAllocatorEEES4_S4_S5_Lj0EEEEZN4JSON8internal7jsonifyIZZN5mesos8internal6master6Master4Http25processStateRequestsBatchEvENKUlRKN7process4http7RequestERKNSI_5OwnedINSD_15ObjectApproversEEEE_clESM_SR_EUlPNSA_12ObjectWriterEE_vEESt8functionIS9_ERKT_NSB_6PreferEEUlS8_E_E9_M_invokeERKSt9_Any_dataS8_
>  @     0x7f36810a41bb _ZNO4JSON5ProxycvSsEv
>  @     0x7f368215f60e process::http::OK::OK()
>  @     0x7f3681219061 _ZN7process20AsyncExecutorProcess7executeIZN5mesos8internal6master6Master4Http25processStateRequestsBatchEvEUlRKNS_4http7RequestERKNS_5OwnedINS2_15ObjectApproversEEEE_S8_SD_Li0EEENSt9result_ofIFT_T0_T1_EE4typeERKSI_SJ_SK_
>  @     0x7f36812212c0 _ZZN7process8dispatchINS_4http8ResponseENS_20AsyncExecutorProcessERKZN5mesos8internal6master6Master4Http25processStateRequestsBatchEvEUlRKNS1_7RequestERKNS_5OwnedINS4_15ObjectApproversEEEE_S9_SE_SJ_RS9_RSE_EENS_6FutureIT_EERKNS_3PIDIT0_EEMSQ_FSN_T1_T2_T3_EOT4_OT5_OT6_ENKUlSt10unique_ptrINS_7PromiseIS2_EESt14default_deleteIS17_EEOSH_OS9_OSE_PNS_11ProcessBaseEE_clES1A_S1B_S1C_S1D_S1F_
>  @     0x7f36812215ac _ZNO6lambda12CallableOnceIFvPN7process11ProcessBaseEEE10CallableFnINS_8internal7PartialIZNS1_8dispatchINS1_4http8ResponseENS1_20AsyncExecutorProcessERKZN5mesos8internal6master6Master4Http25processStateRequestsBatchEvEUlRKNSA_7RequestERKNS1_5OwnedINSD_15ObjectApproversEEEE_SI_SN_SS_RSI_RSN_EENS1_6FutureIT_EERKNS1_3PIDIT0_EEMSZ_FSW_T1_T2_T3_EOT4_OT5_OT6_EUlSt10unique_ptrINS1_7PromiseISB_EESt14default_deleteIS1G_EEOSQ_OSI_OSN_S3_E_IS1J_SQ_SI_SN_St12_PlaceholderILi1EEEEEEclEOS3_
>  @     0x7f36821f3541 process::ProcessBase::consume()
>  @     0x7f3682209fbc process::ProcessManager::resume()
>  @     0x7f368220fa76 _ZNSt6thread5_ImplISt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUlvE_vEEE6_M_runEv
>  @     0x7f367eefc2b0 (unknown)
>  @     0x7f367e71ae25 start_thread
>  @     0x7f367e444bad __clone
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Mime
View raw message