kudu-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Todd Lipcon <t...@cloudera.com>
Subject Re: kudu vs parquet
Date Fri, 23 Dec 2016 01:20:51 GMT
Hi Rotem,

It looks like the dataset you used has some string columns with high length
and low cardinality. By default, Parquet does dictionary encoding of all
strings, whereas Kudu currently does not. So I'm guessing that when you run
these queries, Parquet is able to scan only a MB or less, whereas Kudu is
scanning tens or hundreds of MBs (the raw data).

Can you try recreating the table using dictionary encoding for the STRING
columns? If you are using the latest "impala-kudu" build (
http://archive.cloudera.com/beta/impala-kudu/parcels/latest/) it should
support these options:

   CREATE TABLE tbl_name ([col_name type [PRIMARY KEY] [option [...]]] [,
....])
    where option is:
    | NULL
    | NOT NULL
    | ENCODING encoding_val
    | COMPRESSION compression_algorithm
    | DEFAULT expr
    | BLOCK_SIZE num

encoding_val in this case should be 'DICT_ENCODING'.

I'm about to go on vacation for the winter holidays, but if you can share
the source of this public dataset I can also try to reproduce and
investigate why the performance seems to differ. Differences of this
magnitude are not expected.

-Todd

On Fri, Dec 23, 2016 at 7:03 AM, Rotem Gabay <rotemgabay82@gmail.com> wrote:

> Hi,
>
> I have run some performence tests on small scale cluster ( 2 data nodes ,
> m4.xlarge aws ec2 machines) .
> In order to compare parquet stored table to kudu table  , I have created
> the data from  general public police records.
> I found out that apart from direct PK access, parquet have outperformed
> kudu (on some case on large scale).
> I am using KUDU version 1.1.0 .
> IS this an expected behaviour ?
> Are there any performance guidelines which I should have implemented in
> order to balance the outcomes ?
> (rule of thumb regarding number of buckets ?)
>
> I would realy appreciate shedding some light.
>
> Regards,
> Rotem
>
>
>
>
> [ip-XXX-XX-XX-XX.eu-west-1:21000] > create table
> hash_range_crimes2_parquet  STORED AS PARQUET as select * from
> hash_range_crimes2_raw;
>
>
> [ip-XXX-XX-XX-XX.eu-west-1:21000] >  CREATE TABLE hash_range_crimes2_kudu
> (
>                                                    >     Crime_ID STRING,
>                                                    > Month STRING,
>                                                    > Reported_by STRING,
>                                                    > Falls_within STRING,
>                                                    > Longitude STRING,
>                                                    > Latitude STRING,
>                                                    > `Location` STRING,
>                                                    > LSOA_code STRING,
>                                                    > LSOA_name STRING,
>                                                    > Crime_type STRING,
>                                                    > Last_outcome_category
> STRING,
>                                                    > Context STRING
>                                                    > )
>                                                    > DISTRIBUTE BY HASH
> (crime_id) INTO 8 BUCKETS
>                                                    > TBLPROPERTIES(
>                                                    >   'storage_handler' =
> 'com.cloudera.kudu.hive.KuduStorageHandler',
>                                                    >   'kudu.table_name' =
> 'hash_range_crimes2_kudu',
>                                                    >
> 'kudu.master_addresses' = ' ip-XXX-XX-XX-78.eu-west-1:7051',
>                                                    >   'kudu.key_columns'
> = 'crime_id,month',
>                                                    >
>  'kudu.num_tablet_replicas' = '2'
>                                                    > );
>
>
>
>
> [ip-XXX-XX-XX-XX.eu-west-1:21000] > insert into hash_range_crimes2_kudu
> select * from hash_range_crimes2_raw;
>
>
> [ip-XXX-XX-XX-XX.eu-west-1:21000] > select count(*) from
> hash_range_crimes2_parquet;
> Query: select count(*) from hash_range_crimes2_parquet
> Query submitted at: 2016-12-12 09:39:23 (Coordinator:
> http://ip-XXX-XX-XX-XX.eu-west-1:25000)
> Query progress can be monitored at: http://ip-XXX-XX-XX-XX.eu-
> west-1:25000/query_plan?query_id=6b42a353e6ee0c45:78d0fc7c00000000
> +-----------+
> | count(*)  |
> +-----------+
> | 143582880 |
> +-----------+
> Fetched 1 row(s) in 0.65s
> [ip-XXX-XX-XX-XX.eu-west-1:21000] > select count(*) from
> hash_range_crimes2_kudu;
> Query: select count(*) from hash_range_crimes2_kudu
> Query submitted at: 2016-12-12 09:39:27 (Coordinator:
> http://ip-XXX-XX-XX-XX.eu-west-1:25000)
> Query progress can be monitored at: http://ip-XXX-XX-XX-XX.eu-
> west-1:25000/query_plan?query_id=ea48faeb3d2ab999:8be8a73000000000
> +-----------+
> | count(*)  |
> +-----------+
> | 143582880 |
> +-----------+
> Fetched 1 row(s) in 0.87s
>
>
>
>
> [ip-XXX-XX-XX-XX.eu-west-1:21000] > select crime_id,count(*) from
> hash_range_crimes2_parquet group by crime_id having count(*) > 1;
> Fetched 168704 row(s) in 86.99s
>
> [ip-XXX-XX-XX-XX.eu-west-1:21000] > select crime_id,count(*) from
> hash_range_crimes2_kudu group by crime_id having count(*) > 1;
> Fetched 168704 row(s) in 93.71s
>
>
> [ip-XXX-XX-XX-XX.eu-west-1:21000] > select * from
> hash_range_crimes2_parquet where crime_id  = '
> 2489eb2d3157ecc219583d6c307f943c37713c28030e69853312941faede4756234';
> Query: select * from hash_range_crimes2_parquet where crime_id  = '
> 2489eb2d3157ecc219583d6c307f943c37713c28030e69853312941faede4756234'
> Query submitted at: 2016-12-12 09:59:28 (Coordinator:
> http://ip-XXX-XX-XX-XX.eu-west-1:25000)
> Query progress can be monitored at: http://ip-XXX-XX-XX-XX.eu-
> west-1:25000/query_plan?query_id=164da1054f84cc04:6d9babd200000000
> +-----------------------------------------------------------
> ----------+---------+------------------------------------+--
> ----------------------------------+-----------+-----------+-
> -------------------------------+-----------+-----------+----
> --------------------------+-----------------------+---------+
> | crime_id                                                            |
> month   | reported_by                        | falls_within
>       | longitude | latitude  | location                       | lsoa_code
> | lsoa_name | crime_type                   | last_outcome_category |
> context |
> +-----------------------------------------------------------
> ----------+---------+------------------------------------+--
> ----------------------------------+-----------+-----------+-
> -------------------------------+-----------+-----------+----
> --------------------------+-----------------------+---------+
> | 2489eb2d3157ecc219583d6c307f943c37713c28030e69853312941faede4756234 |
> 2015-02 | Police Service of Northern Ireland | Police Service of Northern
> Ireland | -5.924925 | 54.599348 | On or near Upper Church Lane   |
>   |           | Violence and sexual offences |                       |
>     |
> | 2489eb2d3157ecc219583d6c307f943c37713c28030e69853312941faede4756234 |
> 2016-01 | Police Service of Northern Ireland | Police Service of Northern
> Ireland | -5.970360 | 54.627343 | On or near Silverstream Parade |
>   |           | Violence and sexual offences |                       |
>     |
> +-----------------------------------------------------------
> ----------+---------+------------------------------------+--
> ----------------------------------+-----------+-----------+-
> -------------------------------+-----------+-----------+----
> --------------------------+-----------------------+---------+
> Fetched 2 row(s) in 73.36s
>
> [ip-XXX-XX-XX-XX.eu-west-1:21000] > select * from hash_range_crimes2_kudu
> where crime_id  = '2489eb2d3157ecc219583d6c307f94
> 3c37713c28030e69853312941faede4756234';
>
> Query: select * from hash_range_crimes2_kudu where crime_id  = '
> 2489eb2d3157ecc219583d6c307f943c37713c28030e69853312941faede4756234'
> Query submitted at: 2016-12-12 10:01:13 (Coordinator:
> http://ip-XXX-XX-XX-XX.eu-west-1:25000)
> Query progress can be monitored at: http://ip-XXX-XX-XX-XX.eu-
> west-1:25000/query_plan?query_id=f34f0ca8192ef5c1:30beca2700000000
> +-----------------------------------------------------------
> ----------+---------+------------------------------------+--
> ----------------------------------+-----------+-----------+-
> -------------------------------+-----------+-----------+----
> --------------------------+-----------------------+---------+
> | crime_id                                                            |
> month   | reported_by                        | falls_within
>       | longitude | latitude  | location                       | lsoa_code
> | lsoa_name | crime_type                   | last_outcome_category |
> context |
> +-----------------------------------------------------------
> ----------+---------+------------------------------------+--
> ----------------------------------+-----------+-----------+-
> -------------------------------+-----------+-----------+----
> --------------------------+-----------------------+---------+
> | 2489eb2d3157ecc219583d6c307f943c37713c28030e69853312941faede4756234 |
> 2015-02 | Police Service of Northern Ireland | Police Service of Northern
> Ireland | -5.924925 | 54.599348 | On or near Upper Church Lane   |
>   |           | Violence and sexual offences |                       |
>     |
> | 2489eb2d3157ecc219583d6c307f943c37713c28030e69853312941faede4756234 |
> 2016-01 | Police Service of Northern Ireland | Police Service of Northern
> Ireland | -5.970360 | 54.627343 | On or near Silverstream Parade |
>   |           | Violence and sexual offences |                       |
>     |
> +-----------------------------------------------------------
> ----------+---------+------------------------------------+--
> ----------------------------------+-----------+-----------+-
> -------------------------------+-----------+-----------+----
> --------------------------+-----------------------+---------+
> Fetched 2 row(s) in 1.49s
>
>
>  [ip-XXX-XX-XX-XX.eu-west-1:21000] >select month,count(*) from
> hash_range_crimes2_parquet group by month having count(*) > 1;
>  Fetched 29 row(s) in 3.03s
>
>
>   [ip-XXX-XX-XX-XX.eu-west-1:21000] > select month,count(*) from
> hash_range_crimes2_kudu group by month having count(*) > 1;
> Fetched 29 row(s) in 17.10s
>
>
>  [ip-XXX-XX-XX-XX.eu-west-1:21000] > select `location`,count(*) from
> hash_range_crimes2_parquet group by `location` ;
> Fetched 251776 row(s) in 29.16s
>
>  [ip-XXX-XX-XX-XX.eu-west-1:21000] > select `location`,count(*) from
> hash_range_crimes2_kudu group by `location` ;
> Fetched 251765 row(s) in 65.67s
>
>
>  [ip-XXX-XX-XX-XX.eu-west-1:21000] > select reported_by,count(*) from
> hash_range_crimes2_parquet group by reported_by having count(*) > 1000000 ;
> Fetched 41 row(s) in 4.84s
>
>  [ip-XXX-XX-XX-XX.eu-west-1:21000] > select reported_by,count(*) from
> hash_range_crimes2_kudu group by reported_by having count(*) > 1000000 ;
> Fetched 41 row(s) in 45.14s
>
>  [ip-XXX-XX-XX-XX.eu-west-1:21000] >SELECT month, COUNT(*) from
> hash_range_crimes2_parquet WHERE reported_by = 'Northamptonshire Police'
> GROUP BY month;
> Fetched 28 row(s) in 1.83s
>
>  [ip-XXX-XX-XX-XX.eu-west-1:21000] > SELECT month, COUNT(*)  from
> hash_range_crimes2_kudu WHERE reported_by = 'Northamptonshire Police' GROUP
> BY month;
> Fetched 28 row(s) in 24.10s
>
>  [ip-XXX-XX-XX-XX.eu-west-1:21000] >SELECT  COUNT(*) from
> hash_range_crimes2_parquet WHERE context is not null;
> Fetched 1 row(s) in 1.26s
>
>  [ip-XXX-XX-XX-XX.eu-west-1:21000] > SELECT  COUNT(*) from
> hash_range_crimes2_kudu WHERE context is not null;
> Fetched 1 row(s) in 19.12s
>
>  [ip-XXX-XX-XX-XX.eu-west-1:21000] >SELECT month, COUNT(DISTINCT
> crime_id) from hash_range_crimes2_parquet WHERE reported_by =
> 'Northamptonshire Police' AND crime_type='Violence and sexual offences'
> GROUP BY month;
>
> Fetched 28 row(s) in 48.63s
>
>  [ip-XXX-XX-XX-XX.eu-west-1:21000] >SELECT month, COUNT(DISTINCT
> crime_id) from hash_range_crimes2_kudu WHERE reported_by =
> 'Northamptonshire Police' AND crime_type='Violence and sexual offences'
> GROUP BY month;
>
> Fetched 28 row(s) in 84.18s
>
>
>
>


-- 
Todd Lipcon
Software Engineer, Cloudera

Mime
View raw message