kudu-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Franco Venturi <fvent...@comcast.net>
Subject Efficient way of computing max(PK) in Kudu
Date Thu, 14 Dec 2017 02:54:55 GMT

Scenario: Kudu table with a primary key (PK) made of just one column of type INT64 (i.e. Java

Problem: Compute max(primary key) in an efficient way 

At work we have a Kudu table with billions of rows; its primary key is made of just one column
of type INT64, this column contains a sequence number increasing from 1 to N (number of rows).

For this discussion I am going to assume that the design of this table is fixed and can't
be changed. 

I'd like to find an efficient way to compute the maximum of the primary key, so I know which
sequence number I can start from when inserting another batch of rows in this table. 

Notice that the dual problem, i.e. computing min(PK), can be solved trivially in O(1) time,
due to the fact that in Kudu all the rows within a tablet are kept in primary key sorted order
(see here: https://www.cloudera.com/documentation/enterprise/latest/topics/kudu_schema_design.html#concept_f4d_jsy_1z)
- a simple way to do that is to get a list of KuduScanTokens from a KuduScanTokenBuilder (with
setProjectedColumnIndexes set to return just the 0th column, and limit set to 1), read the
first row returned on each tablet in the list, and compute the minimum of these values. 

In the case of finding the maximum value instead, the simplest approach would be to run a
full table scan (using a KuduScanner or a set of KuduScanTokens in parallel), and find the
maximum among all the values (or the maximum among the last values returned by each tablet).
This approach hoewever scales as O(N) and therefore takes a while to run when the table has
several billion rows (of course with setProjectedColumnIndexes set to return just the primary
key column). 

I also read the API documentation and the code to see if Kudu offered a way to scan the rows
of a table backwards (i.e. in decreasing order), but I couldn't find it (but I would be glad
to be proven wrong on this one). 

After some thinking I came up with this algorithm that uses the lowerBound() method in KuduScannerBuilder
and bisection: given an interval of possible values where the primary key maximum could be,
I create a KuduScannerBuilder with a lowerBound set to a value that is half way between the
two ends of the interval; if that KuduScanner returns at least 1 row (which can be checked
in O(1) time), then the maximum value must be somewhere between the half way point and the
upper end; on the other hand if that KuduScanner returns no rows, then the maximum value must
be in the lower half. 
I then repeat the same process of bisection to the half interval selected above and so on
by using the standard bisection algorithm. As you can easily see, this algorithm is about
I also added a couple of additional tricks to my code: 
- since it is very unlikely that my maximum is in the range of the trillions of quadrillions
(since it is just a sequential number, i.e. it is the same as the number of rows I have),
I run a first lowerBound()+bisection loop to determine the highest '1' bit in the maximum
(i.e. I start with 1<<32, see if there's any row above that lower bound, if there's
none, trying again with 1<<16, and so on) 
- since I imagine that creating KuduScanners (and closing them afterwards) is an "expensive"
operation compared to just scanning a few rows, when the bisection interval reaches some predefined
value (in my tests 128 rows), I switch to just a regular scan of this final interval and find
the maximum among all the values found in this small interval in the same way as the standard
approach described above. 

In conclusion the lowerBound()+bisection algorithm is definitely efficient (and a few tests
I ran showed that), but it seems very complicated (more than it should perhaps), so I was
wondering if I am missing something obvious, and if any of you had to solve a similar problem
in the past, how did you do it? 

Also I haven't looked at the source code for Impala, but I would like to know if Impala uses
any trick (or undocumented rpc call/parameter) when it comes to computing something like this
or scanning the rows of a tablet backwards. 


View raw message