kudu-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jean-Daniel Cryans <jdcry...@apache.org>
Subject Re: first and second run 2x query time difference
Date Wed, 13 Dec 2017 22:07:54 GMT
On Wed, Dec 13, 2017 at 11:30 AM, Boris Tyukin <boris@boristyukin.com>
wrote:

> thanks J-D! we are going to try that and see how it impacts the runtime.
>
> is there any way to load this metadata upfront? a lot of our queries are
> adhoc in nature but they will be hitting the same tables with different
> predicates and join patterns though.
>

You could use Impala to compute all the stats of all the tables after each
Kudu restart. Actually, do try that, restart Kudu then compute stats and
see how fast it scans.


>
> I am curious why this metadata does not survive restarts though. We are
> going to run our benchmarks again and this time restart Kudu and Impala.
>

It's in the tserver memory, it can't survive a restart.


>
> I just ran another query first time which hits 2 large tables and these
> tables have been scanned by the previous query and this time I do not see
> any difference in query time before the first and second time - I guess
> this confirms your statement about " first time ever scanning the table
> since a Kudu restart" and collecting metadata.
>

Maybe, I've been known to be right once or twice a year :)


>
>
> On Wed, Dec 13, 2017 at 11:18 AM, Jean-Daniel Cryans <jdcryans@apache.org>
> wrote:
>
>> Hi Boris,
>>
>> Given that we don't have much data we can use here, I'll have to
>> extrapolate. As an aside though, this is yet another example where we need
>> more Kudu-side metrics in the query profile.
>>
>> So, Kudu lazily loads a bunch of metadata and that can really affect scan
>> times. If this was your first time ever scanning the table since a Kudu
>> restart, it's very possible that that's where that time was spent. There's
>> also the page cache in the OS that might now be populated. You could do
>> something like "sync; echo 3 > /proc/sys/vm/drop_caches" on all the
>> machines and run the query 2 times again, without restarting Kudu, to
>> understand the effect of the page cache itself. There's currently now way
>> to purge the cached metadata in Kudu though.
>>
>> Hope this helps a bit,
>>
>> J-D
>>
>> On Wed, Dec 13, 2017 at 8:07 AM, Boris Tyukin <boris@boristyukin.com>
>> wrote:
>>
>>> Hi guys,
>>>
>>> I am doing some benchmarks with Kudu and Impala/Parquet and hope to
>>> share it soon but there is one thing that bugs me. This is perhaps Impala
>>> question but since I am using Kudu with Impala I am going to try and ask
>>> anyway.
>>>
>>> One of my queries takes 120 seconds to run the very first time. It joins
>>> one large 5B row table with a bunch of smaller tables and then stores
>>> result in Impala/parquet (not Kudu).
>>>
>>> Now if I run it second and third time, it only takes 60 seconds. Can
>>> someone explain why? Is there any settings to decrease this gap?
>>>
>>> I've compared query profiles in CM and the only thing that was very
>>> different is scan against Kudu table (the large one):
>>>
>>> ***************************
>>> first time:
>>> ***************************
>>> KUDU_SCAN_NODE (id=0) (47.68s)
>>> <https://lkmaorabd103.multihosp.net:7183/cmf/impala/queryDetails?queryId=5143f7165be82819%3Ae00a103500000000&serviceName=impala#>
>>>
>>>
>>>
>>>    - BytesRead: *0 B*
>>>    - InactiveTotalTime: *0ns*
>>>    - KuduRemoteScanTokens: *0*
>>>    - NumScannerThreadsStarted: *20*
>>>    - PeakMemoryUsage: *35.8 MiB*
>>>    - RowsRead: *693,502,241*
>>>    - RowsReturned: *693,502,241*
>>>    - RowsReturnedRate: *14643448 per second*
>>>    - ScanRangesComplete: *20*
>>>    - ScannerThreadsInvoluntaryContextSwitches: *1,341*
>>>    - ScannerThreadsTotalWallClockTime: *36.2m*
>>>       - MaterializeTupleTime(*): *47.57s*
>>>       - ScannerThreadsSysTime: *31.42s*
>>>       - ScannerThreadsUserTime: *1.7m*
>>>    - ScannerThreadsVoluntaryContextSwitches: *96,855*
>>>    - TotalKuduScanRoundTrips: *52,308*
>>>    - TotalReadThroughput: *0 B/s*
>>>    - TotalTime: *47.68s*
>>>
>>>
>>> ***************************
>>> second time:
>>> ***************************
>>> KUDU_SCAN_NODE (id=0) (4.28s)
>>> <https://lkmaorabd103.multihosp.net:7183/cmf/impala/queryDetails?queryId=53497a308f860837%3A243772e000000000&serviceName=impala#>
>>>
>>>
>>>
>>>    - BytesRead: *0 B*
>>>    - InactiveTotalTime: *0ns*
>>>    - KuduRemoteScanTokens: *0*
>>>    - NumScannerThreadsStarted: *20*
>>>    - PeakMemoryUsage: *37.9 MiB*
>>>    - RowsRead: *693,502,241*
>>>    - RowsReturned: *693,502,241*
>>>    - RowsReturnedRate: *173481534 per second*
>>>    - ScanRangesComplete: *20*
>>>    - ScannerThreadsInvoluntaryContextSwitches: *1,451*
>>>    - ScannerThreadsTotalWallClockTime: *19.5m*
>>>       - MaterializeTupleTime(*): *4.20s*
>>>       - ScannerThreadsSysTime: *38.22s*
>>>       - ScannerThreadsUserTime: *1.7m*
>>>    - ScannerThreadsVoluntaryContextSwitches: *480,870*
>>>    - TotalKuduScanRoundTrips: *52,142*
>>>    - TotalReadThroughput: *0 B/s*
>>>    - TotalTime: *4.28s*
>>>
>>>
>>>
>>>
>>>
>>
>

Mime
View raw message