kudu-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Todor Petrov <Todor.Pet...@rms.com>
Subject Kudu read - performance issue
Date Fri, 11 May 2018 19:05:48 GMT
Hi there,

I have an interesting performance issue reading from Kudu. Hopefully there is a good explanation
for it because the difference in the performance is quite significant and it puzzles me a
lot.

Basically we have a table with the following schema:

Column1, int32 NOT NULL, BIT_SHUFFLE, NO_COMPRESSION
Column2, int32 NOT NULL, BIT_SHUFFLE, NO_COMPRESSION
.... (a bunch of int32 and int16 columns)

PK is (Column1, Column2)
HASH(Column1) PARTITIONS 4

The number of records is ~60M. ~5K distinct Column1 values. ~1.4M distinct values for Column2.

All tests are made on one core. I think the hardware specs are not important.


1)      If we query all data using



      val scanner =

        kuduClient.getAsyncScannerBuilder(table)

          .addPredicate(KuduPredicate.newComparisonPredicate(Column1Schema, ComparisonOp.EQUAL,
column1Value)).build()



We use 3 scanners in parallel (one query for each unique value of column1).



All fields from the returned rows are read and some internal structures are built.



In this case, it takes ~40 sec to load all the data.



2)      If we query using "InListPredicate", then the performance is super slow.



      val scanner =

        kuduClient.getAsyncScannerBuilder(table)

          .addPredicate(KuduPredicate.newComparisonPredicate(Column1Schema, ComparisonOp.EQUAL,
column1Value))

          .addPredicate(KuduPredicate.newInListPredicate(Column2Schema, column2Values.asJava)).build()



Same as in 1), 3 scanners in parallel, all records are read and some in-memory structures
are built. This time column2 values are split into a bunch of chunks and we send a request
for each unique value of column1 and each chunk of column2 values.



a)      If we chunk column2Values into buckets of 50K ids, it takes ~2900 sec to load everything.

b)      If we put all 1.4M records in one request (column2Values), it's even slower than a)

Can you please explain what InList is doing that might change the performance so dramatically.
As you see in 2b, the number of queries is the same as in 1. That's really a lot of records,
but given these experiments, it seems better to read by column1 only and then make an in-memory
filter by column2. I have always imagined, that the Kudu driver or server can do that much
more efficiently.

Thanks in advance.

Regards,
Todor

Mime
View raw message