kudu-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Hao Hao <hao....@cloudera.com>
Subject Re: Inconsistent read performance with Spark
Date Wed, 13 Feb 2019 23:31:01 GMT
Hi Faraz,

What is the order of your primary key? Is it (datetime, ID) or (ID,
datatime)?

On the contrary, I suspect your scan performance got better for the same
query because compaction happened in between, and thus there were less
blocks to scan. Also would you mind sharing the screen shot of the tablet
server web UI page when your scans took place (to do a comparison between
the 'good' and 'bad' scans) ?

Best,
Hao

On Wed, Feb 13, 2019 at 9:37 AM Faraz Mateen <fmateen@an10.io> wrote:

> By "not noticing any compaction" I meant I did not see any visible change
> in disk space. However, logs show that there were some compaction related
> operations happening during this whole time period. These statements
> appeared multiple times in tserver logs:
>
> W0211 13:44:10.991221 15822 tablet.cc:1679] T
> 00b8818d0713485b83982ac56d9e342a P 7b44fc5229fe43e190d4d6c1e8022988: Can't
> schedule compaction. Clean time has not been advanced past its initial
> value.
> ...
> ...
> I0211 14:36:33.883819 15822 maintenance_manager.cc:302] P
> 7b44fc5229fe43e190d4d6c1e8022988: Scheduling
> MajorDeltaCompactionOp(30c9aaadcb13460fab832bdea1104349): perf
> score=0.106957
> I0211 14:36:33.884233 13179 diskrowset.cc:560] T
> 30c9aaadcb13460fab832bdea1104349 P 7b44fc5229fe43e190d4d6c1e8022988:
> RowSet(3080): Major compacting REDO delta stores (cols: 2 3 4 5 6 7 9 10 11
> 13 14 15 16 20 22 29 31 33 36 38 39 41 42 47 49 51 52 56 57 58 64 67 68 71
> 75 77 78 79 80 81 109 128 137)
>
>
> Does compaction affect scan performance? And if it does, what can I do to
> limit this degradation?
>
>
> On Wed, Feb 13, 2019 at 7:24 PM Faraz Mateen <fmateen@an10.io> wrote:
>
>> Thanks a lot for the help, Hao.
>>
>> Response Inline:
>>
>> You can use tablet server web UI scans dashboard (/scans) to get a better
>>> understanding of the ongoing/past queries. The flag 'scan_history_count' is
>>> used to configure the size of the buffer. From there, you can get
>>> information such as the applied predicates and column stats for the
>>> selected columns.
>>
>>
>> Thanks. I did not know about this.
>>
>> Did you notice any compactions in Kudu between you issued the two
>>> queries? What is your ingest pattern, are you inserting data in random
>>> primary key order?
>>
>>
>> The table has hash partitioning on a ID column that can have 15 different
>> values and range partition on datetime which is split monthly. Both ID and
>> datetime are my primary keys. The data we ingest is in increasing order of
>> time (usually) but the order of IDs is random.
>>
>> However, ingestion into kudu was stopped while I was performing these
>> queries. I did not notice any compaction either.
>>
>> On Wed, Feb 13, 2019 at 2:15 AM Hao Hao <hao.hao@cloudera.com> wrote:
>>
>>> Hi Faraz,
>>>
>>> Answered inline below.
>>>
>>> Best,
>>> Hao
>>>
>>> On Tue, Feb 12, 2019 at 6:59 AM Faraz Mateen <fmateen@an10.io> wrote:
>>>
>>>> Hi all,
>>>>
>>>> I am using spark to pull data from my single node testing kudu setup
>>>> and publish it to kafka. However, my query time is not consistent.
>>>>
>>>> I am querying a table with around *1.1 million *packets. Initially my
>>>> query was taking* 537 seconds to read 51042 records* from kudu and
>>>> write them to kafka. This rate was quite low than what I had expected. I
>>>> had around 45 tables with little data in them that was not needed anymore.
>>>> I deleted all those tables, restarted spark session and attempted the same
>>>> query. Now the query completed in* 5.3 seconds*.
>>>>
>>>> I increased the number of rows to be fetched and tried the same query.
>>>> Rows count was *118741* but it took *1861 seconds *to complete. During
>>>> the query, resource utilization of my servers was very low. When I
>>>> attempted the same query again after a couple of hours, it took only*
>>>> 16 secs*.
>>>>
>>>> After this I kept increasing the number of rows to be fetched and the
>>>> time kept increasing in linear fashion.
>>>>
>>>> What I want to ask is:
>>>>
>>>>    - How can I debug why the time for these queries is varying so
>>>>    much? I am not able to get anything out of Kudu logs.
>>>>
>>>> You can use tablet server web UI scans dashboard (/scans) to get a
>>> better understanding of the ongoing/past queries. The flag
>>> 'scan_history_count' is used to configure the size of the buffer. From
>>> there, you can get information such as the applied predicates and column
>>> stats for the selected columns.
>>>
>>>
>>>>
>>>>    - I am running kudu with default configurations. Are there any
>>>>    tweaks I should perform to boast the performance of my setup?
>>>>
>>>> Did you notice any compactions in Kudu between you issued the two
>>> queries? What is your ingest pattern, are you inserting data in random
>>> primary key order?
>>>
>>>>
>>>>    - Does having a lot of tables cause performance issues?
>>>>
>>>> If there is no hitting of resource limitation due to writes/scans to
>>> the other tables, they shouldn't affect the performance of your queries.
>>> Just FYI, this is the scale guide
>>> <https://kudu.apache.org/docs/scaling_guide.html> with respect to
>>> various system resources.
>>>
>>>>
>>>>    - Will having more masters and tservers improve my query time?
>>>>
>>>> Master is not likely to be the bottleneck, as client communicate
>>> directly to tserver for query once he/she knows which tserver to talk to.
>>> But separating master and tserver to be on the same node might help. This
>>> is the scale limitation
>>> <https://kudu.apache.org/docs/known_issues.html#_scale> guide for
>>> roughly estimation of number of tservers required for a given quantity of
>>> data.
>>>
>>> *Environment Details:*
>>>>
>>>>    - Single node Kudu 1.7 master and tserver. Server has 4 vCPUs and
>>>>    16 GB RAM.
>>>>    - Table that I am querying is hash partitioned on the basis of an
>>>>    ID with 3 buckets. It is also range partitioned on the basis of datetime
>>>>    with a new partition for each month.
>>>>    - Kafka version 1.1.
>>>>    - Standalone Spark 2.3.0 deployed on a server with 2 vCPU 4 GB RAM.
>>>>
>>>> --
>>>> Faraz Mateen
>>>>
>>>
>>
>> --
>> Faraz Mateen
>>
>
>
> --
> Faraz Mateen
>

Mime
View raw message