kudu-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Todd Lipcon <t...@cloudera.com>
Subject Re: scan performance super bad
Date Sun, 13 May 2018 21:54:59 GMT
Can you share the code you are using to create the scanner and call

Can you also copy-paste the info provided on the web UI of the kudu master
for this table? It will show the schema and partitioning information.

Is it possible that your table includes a lot of deleted rows? i.e did you
load the table, then delete all the rows, then load again? This can cause
some performance issues in current versions of Kudu as the scanner needs to
"skip over" the deleted rows before it finds any to return.

Based on your description I would expect this to be doing a simple range
scan for the returned rows, and return in just a few milliseconds. The fact
that it is taking 500ms implies that the server is scanning a lot of
non-matching rows before finding a few that match. You can also check the


and compare the 'rows scanned' vs 'rows returned' metric. Capture the
values both before and after you run the query, and you should see if
'rows_scanned' is much larger than 'rows_returned'.


On Sun, May 13, 2018 at 12:56 AM, 一米阳光 <710339587@qq.com> wrote:

> hi, i have faced a difficult problem when using kudu 1.6.
> my kudu table schema is generally like this:
> column name:key, type:string, prefix encoding, lz4 compression, primary key
> column name:value, type:string, lz4 compression
> the primary key is built from several parts:
> 001320_201803220420_00000001
> the first part is a unique id,
> the second part is time format string,
> the third part is incremental integer(for a unique id and an fixed time,
> there may exist multi value, so i used this part to distinguish
> <http://dict.youdao.com/w/distinguish/#keyfrom=E2Ctranslation>)
> the table range partition use the first part, split it like below
> range<005000
> 005000<= range <010000
> 010000<= range <015000
> 015000<= range <020000
> .....
> .....
> 995000<= range
> when i want to scan data for a unique id and range of time, the lower
> bound like 001320_201803220420_00000001 and the higher bound like
> 001320_201803230420_99999999, it takes about 500ms to call
> kuduScanner.nextRows() and the number of rows it returns is between 20~50.
> All size of data between the bound is about 8000, so i should call hundreds
> times nextRows() to fetch all data, and it finally cost several minutes.
> i don't know why this happened and how to resolve it....maybe the final
> solution is that i should giving up kudu, using hbase instead...

Todd Lipcon
Software Engineer, Cloudera

View raw message