kudu-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From William Berkeley <wdberke...@cloudera.com>
Subject Re: Difference in count(*) result for KUDU and parquet
Date Thu, 10 May 2018 05:59:31 GMT
Hi Geetika. While I don't know anything about TPCH data, when people load
data and see less rows it's usually because of duplicated primary keys.
Kudu, unlike parquet, has a unique key constraint. What's the schema for
the Kudu table?

Also, might be useful to know what Kudu version and Impala version you are
using.

-Will

On Wed, May 9, 2018 at 10:03 PM, Geetika Gupta <geetika.gupta@knoldus.in>
wrote:

> Hi community,
>
> We executed the below command to load data in KUDU, but the table in which
> we loaded the data has less number of rows. We executed the following
> command:
>
> insert into LINEITEM select * from PARQUETIMPALA500.LINEITEM
>
> This query was successful, but when we tried the count(*) on both the
> tables, row count was different:
>
> 0: jdbc:hive2://slave2:21050/default> select count(*) from lineitem
> . . . . . . . . . . . . . . . . . . > ;
> 536870912
>
> 0: jdbc:hive2://slave2:21050/default> select count(*) from
> parquetimpala500.lineitem;
> 3000028242
>
> We are loading 500GB of TPCH data in kudu from parquet table.
>
> --
> Regards,
> Geetika Gupta
>

Mime
View raw message