Mailing-List: contact reviews-help@aurora.apache.org; run by ezmlm
Precedence: bulk
Reply-To: reviews@aurora.apache.org
Content-Type: multipart/alternative;
 boundary="===============8637582314504276925=="
MIME-Version: 1.0
Subject: Re: Review Request 43457: Increase throughput of DbTaskStore
From: Maxim Khutornenko <maxim@apache.org>
To: Maxim Khutornenko <maxim@apache.org>, John Sirois <jsirois@apache.org>
Cc: Bill Farner <wfarner@apache.org>, Aurora ReviewBot <wfarner@apache.org>,
 Zameer Manji <zmanji@apache.org>, Aurora <reviews@aurora.apache.org>
Date: Fri, 12 Feb 2016 01:09:34 -0000
Message-ID: <20160212010934.24149.10793@reviews.apache.org>
Auto-Submitted: auto-generated
Sender: Maxim Khutornenko <noreply@reviews.apache.org>
References: <20160211005719.24150.75135@reviews.apache.org>
In-Reply-To: <20160211005719.24150.75135@reviews.apache.org>
Reply-To: Maxim Khutornenko <maxim@apache.org>

--===============8637582314504276925==
MIME-Version: 1.0
Content-Type: text/plain; charset="utf-8"
Content-Transfer-Encoding: 8bit


> On Feb. 11, 2016, 12:57 a.m., Bill Farner wrote:
> > It would be nice to hear how this change jives with the opposite change made in https://reviews.apache.org/r/42882
> 
> Maxim Khutornenko wrote:
>     I thought about that too. I think there are 2 major differences: the total number of rows generated by the multi-join select statement and number of required subselects. In that RB, lowering the row count from 500k to just under 100 plus the low number of required subselects helped to unlock perf gains.
>     
>     In this particular scenario, it appears that the frequency of subselects trumps everything else.
>     
>     Zameer, what's the overall number of rows returned by the select statement for a single task in your case?
> 
> Zameer Manji wrote:
>     Running:
>     ````
>         SELECT
>           t.id AS row_id,
>           t.task_config_row_id AS task_config_row_id,
>           t.task_id AS task_id,
>           t.instance_id AS instance_id,
>           t.status AS status,
>           t.failure_count AS failure_count,
>           t.ancestor_task_id AS ancestor_id,
>           j.role AS c_j_role,
>           j.environment AS c_j_environment,
>           j.name AS c_j_name,
>           h.slave_id AS slave_id,
>           h.host AS slave_host,
>           tp.name as tp_name,
>           tp.port as tp_port,
>           te.timestamp_ms as te_timestamp,
>           te.status as te_status,
>           te.message as te_message,
>           te.scheduler_host as te_scheduler
>         FROM tasks AS t
>         INNER JOIN task_configs as c ON c.id = t.task_config_row_id
>         INNER JOIN job_keys AS j ON j.id = c.job_key_id
>         LEFT OUTER JOIN task_ports as tp ON tp.task_row_id = t.id
>         LEFT OUTER JOIN task_events as te ON te.task_row_id = t.id
>         LEFT OUTER JOIN host_attributes AS h ON h.id = t.slave_row_id
>         WHERE task_id = '1454546771388-zmanji-devel-labrat-237-0e52b4a9-a8da-4958-997f-7bbe3db6b5d2'
>     ````
>     
>     On a test cluster returns 4 rows where thhe task is in the RUNNING state.
>     
>     If we consider it, a job typically does not allocate that many ports, and will have less than 8 events on the task.
>     
>     Further running
>     ````
>         SELECT
>           c.id AS id,
>           c.creator_user AS creator_user,
>           c.service AS is_service,
>           c.num_cpus AS num_cpus,
>           c.ram_mb AS ram_mb,
>           c.disk_mb AS disk_mb,
>           c.priority AS priority,
>           c.max_task_failures AS max_task_failures,
>           c.production AS production,
>           c.contact_email AS contact_email,
>           c.executor_name AS executor_name,
>           c.executor_data AS executor_data,
>           c.tier AS tier,
>           j.role AS j_role,
>           j.environment AS j_environment,
>           j.name AS j_name,
>           p.port_name AS p_port_name,
>           d.id AS c_id,
>           d.image AS c_image,
>           m.id AS m_id,
>           m.key AS m_key,
>           m.value AS m_value,
>           tc.id AS constraint_id,
>           tc.name AS constraint_name,
>           tlc.id AS constraint_l_id,
>           tlc.value AS constraint_l_limit,
>           tvc.id AS constraint_v_id,
>           tvc.negated AS constraint_v_negated,
>           tvcv.value as constraint_v_v_value
>         FROM task_configs AS c
>         INNER JOIN job_keys AS j ON j.id = c.job_key_id
>         LEFT OUTER JOIN task_config_requested_ports AS p ON p.task_config_id = c.id
>         LEFT OUTER JOIN task_config_docker_containers AS d ON d.task_config_id = c.id
>         LEFT OUTER JOIN task_config_metadata AS m ON m.task_config_id = c.id
>         LEFT OUTER JOIN task_constraints AS tc ON tc.task_config_id = c.id
>         LEFT OUTER JOIN limit_constraints as tlc ON tlc.constraint_id = tc.id
>         LEFT OUTER JOIN value_constraints as tvc ON tvc.constraint_id = tc.id
>         LEFT OUTER JOIN value_constraint_values AS tvcv ON tvcv.value_constraint_id = tvc.id
>         WHERE c.id = 1
>     ````
>     
>     Returns 2 rows for a a task in the above job.
>     
>     I think this is because a tpyical job doesn't have that many constraints.

Thanks Zameer. This confirms my assumptions about row count vs. sub-select chattiness.


- Maxim


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/43457/#review118790
-----------------------------------------------------------


On Feb. 11, 2016, 8:03 p.m., Zameer Manji wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/43457/
> -----------------------------------------------------------
> 
> (Updated Feb. 11, 2016, 8:03 p.m.)
> 
> 
> Review request for Aurora, John Sirois and Maxim Khutornenko.
> 
> 
> Repository: aurora
> 
> 
> Description
> -------
> 
> Profiling master indicated that the bottleneck was MyBatis populating ResultSets and populating the resulting objects. This patch removes subselects, which reduces the number of ResultSets and removes the population of an object via a constructor which is slower than populating an object via setters.
> 
> 
> Diffs
> -----
> 
>   src/main/java/org/apache/aurora/scheduler/storage/db/views/DbAssginedPort.java PRE-CREATION 
>   src/main/java/org/apache/aurora/scheduler/storage/db/views/DbAssignedTask.java 93722395ed9fcd22dcb12e34e648e6e410952d43 
>   src/main/java/org/apache/aurora/scheduler/storage/db/views/DbScheduledTask.java 502a1fa6fc141df498f0f09af292ce24e269731d 
>   src/main/resources/org/apache/aurora/scheduler/storage/db/TaskConfigMapper.xml b1394cf44b7ddafcbc47bb1968306d0b33293380 
>   src/main/resources/org/apache/aurora/scheduler/storage/db/TaskMapper.xml ea469cce31544221c34ae05a1c65f71271985655 
> 
> Diff: https://reviews.apache.org/r/43457/diff/
> 
> 
> Testing
> -------
> 
> Master:
> Benchmark                                      (numTasks)   Mode  Cnt   Score    Error  Units
> TaskStoreBenchmarks.DBFetchTasksBenchmark.run       10000  thrpt    5  44.052 ± 14.689  ops/s
> TaskStoreBenchmarks.DBFetchTasksBenchmark.run       50000  thrpt    5   0.179 ±  0.052  ops/s
> TaskStoreBenchmarks.DBFetchTasksBenchmark.run      100000  thrpt    5   0.087 ±  0.022  ops/s
> 
> This Patch:
> Benchmark                                      (numTasks)   Mode  Cnt   Score   Error  Units
> TaskStoreBenchmarks.DBFetchTasksBenchmark.run       10000  thrpt    5  51.531 ± 7.236  ops/s
> TaskStoreBenchmarks.DBFetchTasksBenchmark.run       50000  thrpt    5   7.370 ± 1.320  ops/s
> TaskStoreBenchmarks.DBFetchTasksBenchmark.run      100000  thrpt    5   2.143 ± 1.234  ops/s
> 
> 
> Thanks,
> 
> Zameer Manji
> 
>


--===============8637582314504276925==--