aurora-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Shyam Patel <sham.pate...@gmail.com>
Subject Re: Aurora performance impact with hourly query runs
Date Sun, 12 Jun 2016 16:48:27 GMT
The query performance improved drastically, It took only 29ms for 12K jobs/30K tasks.. (from
an hour !)

Thanks Maxim for quick lead, really appreciate your help.



Thanks,
Sham

> On Jun 9, 2016, at 10:06 AM, Maxim Khutornenko <maxim@apache.org> wrote:
> 
> Scheduler persists its state in the Mesos replicated log regardless of
> the in-memory engine. If you change the flag and restart scheduler all
> tasks are going to be re-inserted into MemTaskStore instead of
> DBTaskStore. No data will be lost.
> 
> On Thu, Jun 9, 2016 at 9:55 AM, Shyam Patel <sham.patel04@gmail.com> wrote:
>> Thanks Maxim,
>> 
>> If we move to mem task store, restart of aurora would lose the data ? (btw, I’m
running aurora in a container)
>> 
>> 
>> 
>>> On Jun 9, 2016, at 8:37 AM, Maxim Khutornenko <maxim@apache.org> wrote:
>>> 
>>> There are plenty of factors that may contribute towards the behavior
>>> you're observing. Based on the logs though it appears you are using
>>> DBTaskStore (-use_beta_db_task_store=true)? If so, you may want to
>>> revert to the default in-mem task store
>>> (-use_beta_db_task_store=false) as DBTaskStore is known to perform
>>> subpar on large task counts. This is a known issue and we plan to
>>> invest into making it faster.
>>> 
>>> On Thu, Jun 9, 2016 at 6:58 AM, Erb, Stephan
>>> <Stephan.Erb@blue-yonder.com> wrote:
>>>> I am no expert here, but I would assume that slow task store operations could
result from a slow replicated log. Have you tried keeping it on an SSD? (https://github.com/apache/aurora/blob/e89521f1eebd9a5301eb02e2ed6ffebdecd54c9a/docs/operations/configuration.md#-native_log_file_path)
>>>> 
>>>> FWIW, there was a recent RB by Maxim to reduce Master load unter task reconciliation:
https://reviews.apache.org/r/47373/diff/2#index_header
>>>> ________________________________________
>>>> From: Shyam Patel <sham.patel04@gmail.com>
>>>> Sent: Thursday, June 9, 2016 07:48
>>>> To: dev@aurora.apache.org
>>>> Subject: Re: Aurora performance impact with hourly query runs
>>>> 
>>>> Hi Bill,
>>>> 
>>>> Cluster Set up : AWS
>>>> 
>>>> 1 Mesos , 1 ZK , 1 Aurora instance : 4 CPU, 16G mem
>>>> 
>>>> Aurora : Xmx 14G
>>>> 
>>>> 100 nodes agent cluster : 40 CPU, 160G mem each
>>>> 
>>>> 8000 Jobs, each with 2 instances. So, total ~16K containers
>>>> 
>>>> 
>>>> Thanks,
>>>> Sham
>>>> 
>>>> 
>>>> 
>>>>> On Jun 8, 2016, at 9:18 PM, Bill Farner <wfarner@apache.org> wrote:
>>>>> 
>>>>> Can you give some insight into the machine specs and JVM options used?
>>>>> 
>>>>> Also, is it 8000 jobs or tasks?  The terms are often mixed up, but will
>>>>> have a big difference here.
>>>>> 
>>>>> On Wednesday, June 8, 2016, Shyam Patel <sham.patel04@gmail.com>
wrote:
>>>>> 
>>>>>> Hi,
>>>>>> 
>>>>>> While running LnP testing, I’m spinning of 8K docker jobs. During
the run,
>>>>>> I ran into issue where TaskStatUpdate and TaskReconciler queries
taking
>>>>>> real long times. During the time, Aurora is pretty much freezing
and at a
>>>>>> point dying.  Also, tried the same run w/o the docker jobs and faced
the
>>>>>> same issue.
>>>>>> 
>>>>>> 
>>>>>> Is there a way to keep the Aurora performance intact during the query
runs
>>>>>> ?
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> Here is snipped from log :
>>>>>> 
>>>>>> 
>>>>>> I0602 00:53:37.527 [TaskStatUpdaterService RUNNING, DbTaskStore:104]
Query
>>>>>> took 1243517 ms: TaskQuery(owner:null, role:null, environment:null,
>>>>>> jobName:null, taskIds:null, statuses:[STARTING, THROTTLED, RUNNING,
>>>>>> DRAINING, ASSIGNED, KILLING, RESTARTING, PENDING, PREEMPTING],
>>>>>> instanceIds:null, slaveHosts:null, jobKeys:null, offset:0, limit:0)
>>>>>> 
>>>>>> 
>>>>>> I0602 00:56:54.180 [TaskReconciler-0, DbTaskStore:104] Query took
1380169
>>>>>> ms: TaskQuery(owner:null, role:null, environment:null, jobName:null,
>>>>>> taskIds:null, statuses:[STARTING, RUNNING, DRAINING, ASSIGNED, KILLING,
>>>>>> RESTARTING, PREEMPTING], instanceIds:null, slaveHosts:null, jobKeys:null,
>>>>>> offset:0, limit:0)
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> Appreciate any insights..
>>>>>> 
>>>>>> 
>>>>>> Thanks,
>>>>>> Sham
>>>>>> 
>>>>>> 
>> 


Mime
View raw message