kudu-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dan Burkert <danburk...@apache.org>
Subject Re: Difference in count(*) result for KUDU and parquet
Date Thu, 10 May 2018 16:56:51 GMT
Hi Geetika, this is a known issue
<https://www.cloudera.com/documentation/enterprise/5-14-x/topics/impala_jdbc.html#jdbc_kudu>
in the Impala JDBC driver.  For further questions about that JDBC driver
I'd direct you to Cloudera's forums, since it's not an Apache or Kudu
component.

- Dan

On Thu, May 10, 2018 at 2:53 AM, Geetika Gupta <geetika.gupta@knoldus.in>
wrote:

> Thanks, William
>
> The problem was due to the duplicated primary keys issue. So changing the
> schema for the table resolved our issue.
> But as per the documentation when we try to insert a row with the same
> primary key values as an existing row, it should result in a duplicate key
> error.
> However, no error was thrown related to primary key duplication and the
> query execution was successful.
>
>
>
> On Thu, May 10, 2018 at 11:29 AM, William Berkeley <
> wdberkeley@cloudera.com> wrote:
>
>> Hi Geetika. While I don't know anything about TPCH data, when people load
>> data and see less rows it's usually because of duplicated primary keys.
>> Kudu, unlike parquet, has a unique key constraint. What's the schema for
>> the Kudu table?
>>
>> Also, might be useful to know what Kudu version and Impala version you
>> are using.
>>
>> -Will
>>
>> On Wed, May 9, 2018 at 10:03 PM, Geetika Gupta <geetika.gupta@knoldus.in>
>> wrote:
>>
>>> Hi community,
>>>
>>> We executed the below command to load data in KUDU, but the table in
>>> which we loaded the data has less number of rows. We executed the following
>>> command:
>>>
>>> insert into LINEITEM select * from PARQUETIMPALA500.LINEITEM
>>>
>>> This query was successful, but when we tried the count(*) on both the
>>> tables, row count was different:
>>>
>>> 0: jdbc:hive2://slave2:21050/default> select count(*) from lineitem
>>> . . . . . . . . . . . . . . . . . . > ;
>>> 536870912
>>>
>>> 0: jdbc:hive2://slave2:21050/default> select count(*) from
>>> parquetimpala500.lineitem;
>>> 3000028242
>>>
>>> We are loading 500GB of TPCH data in kudu from parquet table.
>>>
>>> --
>>> Regards,
>>> Geetika Gupta
>>>
>>
>>
>
>
> --
> Regards,
> Geetika Gupta
>

Mime
View raw message