mesos-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Meghdoot bhattacharya <meghdoo...@yahoo.com.INVALID>
Subject Re: Implicit reconcile "pauses" offer stream in large cluster
Date Sat, 30 Dec 2017 16:15:01 GMT
Zhitao any further updates on this?

Thx

> On Dec 13, 2017, at 1:02 PM, Benjamin Mahler <bmahler@apache.org> wrote:
> 
> You can check the diff, for example:
> https://github.com/apache/mesos/compare/1.3.0...1.4.0
> 
> I didn't notice any changes that look like they would cause this.
> 
> What do the master logs show during the time frame?
> Have you profiled what the master and scheduler are doing during this time
> frame?
> 
>> On Tue, Dec 12, 2017 at 10:46 AM, Zhitao Li <zhitaoli.cs@gmail.com> wrote:
>> 
>> Hi,
>> 
>> We have seen some potential problems when trying to upgrading Mesos from
>> 1.3 to 1.4: when an implicit reconciliation happened for a large framework
>> (Aurora) , the scheduler would not see any offer for several minutes.
>> Strangely this does not show up once we revert back to 1.3.
>> 
>> A couple of questions:
>> 
>> 1) Is there any between 1.3 and 1.4 which can make this slower?
>> 2) FWICT by reading code of implicit reconcile, Mesos master sends back
>> status for all active and pending tasks for the framework (which has 70k+
>> in our cluster right now) in batch before yielding to any other messages.
>> Has anyone thought about supporting some kind of "pagination": i.e, master
>> would only send back N status updates, then delay for S seconds, then send
>> back next batch of N updates, until all active tasks are handled. This is
>> pretty much how Aurora triggers explicit reconcile to Mesos, and we don't
>> see any issue when processing it this way.
>> 
>> Thanks!
>> 
>> 
>> --
>> Cheers,
>> 
>> Zhitao Li
>> 

Mime
View raw message