kudu-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Faraz Mateen <fmat...@an10.io>
Subject Re: Inconsistent read performance with Spark
Date Wed, 13 Feb 2019 17:37:16 GMT
By "not noticing any compaction" I meant I did not see any visible change
in disk space. However, logs show that there were some compaction related
operations happening during this whole time period. These statements
appeared multiple times in tserver logs:

W0211 13:44:10.991221 15822 tablet.cc:1679] T
00b8818d0713485b83982ac56d9e342a P 7b44fc5229fe43e190d4d6c1e8022988: Can't
schedule compaction. Clean time has not been advanced past its initial
value.
...
...
I0211 14:36:33.883819 15822 maintenance_manager.cc:302] P
7b44fc5229fe43e190d4d6c1e8022988: Scheduling
MajorDeltaCompactionOp(30c9aaadcb13460fab832bdea1104349): perf
score=0.106957
I0211 14:36:33.884233 13179 diskrowset.cc:560] T
30c9aaadcb13460fab832bdea1104349 P 7b44fc5229fe43e190d4d6c1e8022988:
RowSet(3080): Major compacting REDO delta stores (cols: 2 3 4 5 6 7 9 10 11
13 14 15 16 20 22 29 31 33 36 38 39 41 42 47 49 51 52 56 57 58 64 67 68 71
75 77 78 79 80 81 109 128 137)


Does compaction affect scan performance? And if it does, what can I do to
limit this degradation?


On Wed, Feb 13, 2019 at 7:24 PM Faraz Mateen <fmateen@an10.io> wrote:

> Thanks a lot for the help, Hao.
>
> Response Inline:
>
> You can use tablet server web UI scans dashboard (/scans) to get a better
>> understanding of the ongoing/past queries. The flag 'scan_history_count' is
>> used to configure the size of the buffer. From there, you can get
>> information such as the applied predicates and column stats for the
>> selected columns.
>
>
> Thanks. I did not know about this.
>
> Did you notice any compactions in Kudu between you issued the two queries?
>> What is your ingest pattern, are you inserting data in random primary key
>> order?
>
>
> The table has hash partitioning on a ID column that can have 15 different
> values and range partition on datetime which is split monthly. Both ID and
> datetime are my primary keys. The data we ingest is in increasing order of
> time (usually) but the order of IDs is random.
>
> However, ingestion into kudu was stopped while I was performing these
> queries. I did not notice any compaction either.
>
> On Wed, Feb 13, 2019 at 2:15 AM Hao Hao <hao.hao@cloudera.com> wrote:
>
>> Hi Faraz,
>>
>> Answered inline below.
>>
>> Best,
>> Hao
>>
>> On Tue, Feb 12, 2019 at 6:59 AM Faraz Mateen <fmateen@an10.io> wrote:
>>
>>> Hi all,
>>>
>>> I am using spark to pull data from my single node testing kudu setup and
>>> publish it to kafka. However, my query time is not consistent.
>>>
>>> I am querying a table with around *1.1 million *packets. Initially my
>>> query was taking* 537 seconds to read 51042 records* from kudu and
>>> write them to kafka. This rate was quite low than what I had expected. I
>>> had around 45 tables with little data in them that was not needed anymore.
>>> I deleted all those tables, restarted spark session and attempted the same
>>> query. Now the query completed in* 5.3 seconds*.
>>>
>>> I increased the number of rows to be fetched and tried the same query.
>>> Rows count was *118741* but it took *1861 seconds *to complete. During
>>> the query, resource utilization of my servers was very low. When I
>>> attempted the same query again after a couple of hours, it took only*
>>> 16 secs*.
>>>
>>> After this I kept increasing the number of rows to be fetched and the
>>> time kept increasing in linear fashion.
>>>
>>> What I want to ask is:
>>>
>>>    - How can I debug why the time for these queries is varying so much?
>>>    I am not able to get anything out of Kudu logs.
>>>
>>> You can use tablet server web UI scans dashboard (/scans) to get a
>> better understanding of the ongoing/past queries. The flag
>> 'scan_history_count' is used to configure the size of the buffer. From
>> there, you can get information such as the applied predicates and column
>> stats for the selected columns.
>>
>>
>>>
>>>    - I am running kudu with default configurations. Are there any
>>>    tweaks I should perform to boast the performance of my setup?
>>>
>>> Did you notice any compactions in Kudu between you issued the two
>> queries? What is your ingest pattern, are you inserting data in random
>> primary key order?
>>
>>>
>>>    - Does having a lot of tables cause performance issues?
>>>
>>> If there is no hitting of resource limitation due to writes/scans to the
>> other tables, they shouldn't affect the performance of your queries. Just
>> FYI, this is the scale guide
>> <https://kudu.apache.org/docs/scaling_guide.html> with respect to
>> various system resources.
>>
>>>
>>>    - Will having more masters and tservers improve my query time?
>>>
>>> Master is not likely to be the bottleneck, as client communicate
>> directly to tserver for query once he/she knows which tserver to talk to.
>> But separating master and tserver to be on the same node might help. This
>> is the scale limitation
>> <https://kudu.apache.org/docs/known_issues.html#_scale> guide for
>> roughly estimation of number of tservers required for a given quantity of
>> data.
>>
>> *Environment Details:*
>>>
>>>    - Single node Kudu 1.7 master and tserver. Server has 4 vCPUs and 16
>>>    GB RAM.
>>>    - Table that I am querying is hash partitioned on the basis of an ID
>>>    with 3 buckets. It is also range partitioned on the basis of datetime with
>>>    a new partition for each month.
>>>    - Kafka version 1.1.
>>>    - Standalone Spark 2.3.0 deployed on a server with 2 vCPU 4 GB RAM.
>>>
>>> --
>>> Faraz Mateen
>>>
>>
>
> --
> Faraz Mateen
>


-- 
Faraz Mateen

Mime
View raw message