Hi Geetika. While I don't know anything about TPCH data, when people load data and see less rows it's usually because of duplicated primary keys. Kudu, unlike parquet, has a unique key constraint. What's the schema for the Kudu table?

Also, might be useful to know what Kudu version and Impala version you are using.


We executed the below command to load data in KUDU, but the table in which we loaded the data has less number of rows. We executed the following command:

insert into LINEITEM select * from PARQUETIMPALA500.LINEITEM

This query was successful, but when we tried the count(*) on both the tables, row count was different:

We are loading 500GB of TPCH data in kudu from parquet table.

Geetika Gupta