kudu-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Todd Lipcon <t...@cloudera.com>
Subject Re: Apache Kudu Table is 6.6 times larger than Parquet File.
Date Mon, 13 Mar 2017 06:30:09 GMT
Hi Jason,

The first thing that jumps out to me is that you aren't using dictionary
encoding on your string columns. I would recommend using DICT_ENCODING for
all string fields and BIT_SHUFFLE for all int/double/float fields. If you
have any string fields which are not repetitive (low cardinality) then I
would also recommend enabling LZ4 compression on them (Parquet uses lz4 by
default on all strings).

That should get you close to Parquet sizes (and those are the new defaults
in the upcoming 1.3 release). If you still see a 6x blowup after making
these changes please report back.

-Todd

On Fri, Mar 10, 2017 at 7:16 PM, Jason Heo <jason.heo.sde@gmail.com> wrote:

> Hello, I'm new to Apache Kudu. I was really impressed by the concept of
> Kudu and benchmark results. I'm considering using (Impala + Kudu) on my
> team project.
>
> One of the issues I have is that Kudu Table is too big compared to Parquet
> File
>
> - Parquet File: 1.3TB
> - Kudu Table: 8.6TB
>
> (both tables configured 3 replica factor)
>
> I'm using Kudu with CDH 5.10 and most of the configurations is not changed
> (I've only changed `memory_limit_hard_bytes` and `block_cache_capacity_mb`
> to prevent bulk load error)
>
> When I changed `ENCODING` for some fields, only decreased by 5%. I'm
> thinking there are some optimization techniques to reduce Kudu table size.
>
> I would really appreciate it if someone gives advice to me.
>
> Thanks for advance answer.
>
> `parquet_table` has 38 STRING fields and 6B rows.
>
> The schema of `parquet_table` looks like belows
>
>     ```
>     > SHOW CREATE TABLE parquet_table;
>     +-----------------------------------------------------------
> ----------------------+
>     | result
>            |
>     +-----------------------------------------------------------
> ----------------------+
>     | CREATE EXTERNAL TABLE default.parquet_table (
>             |
>     |   a STRING,
>             |
>     |   b STRING,
>             |
>     |   c STRING,
>             |
>     |   d STRING,
>             |
>         ...
>         ...
>     | )
>             |
>     | PARTITIONED BY (
>            |
>     |   ymd STRING
>            |
>     | )
>             |
>     | WITH SERDEPROPERTIES ('serialization.format'='1')
>             |
>     | STORED AS PARQUET
>             |
>     | LOCATION 'hdfs://hostname/path/to/parquet' |
>     |
>             |
>     +-----------------------------------------------------------
> ----------------------+
>     ```
>
> I've created `kudu_table` and bulk loaded using `INSERT INTO kudu SELECT *
> FROM parquet_table`
>
>     ```
>     > SHOW CREATE TABLE kudu_table;
>     +-----------------------------------------------------------
> -----------------------+
>     | result
>             |
>     +-----------------------------------------------------------
> -----------------------+
>     | CREATE TABLE default.kudu_table (
>              |
>     |   a STRING NOT NULL ENCODING AUTO_ENCODING COMPRESSION
> DEFAULT_COMPRESSION,      |
>     |   b STRING NOT NULL ENCODING AUTO_ENCODING COMPRESSION
> DEFAULT_COMPRESSION,      |
>     |   c STRING NULL ENCODING AUTO_ENCODING COMPRESSION
> DEFAULT_COMPRESSION,          |
>     |   d STRING NULL ENCODING AUTO_ENCODING COMPRESSION
> DEFAULT_COMPRESSION,          |
>         ...
>     |   PRIMARY KEY (a, b)
>             |
>     | )
>              |
>     | PARTITION BY HASH (a) PARTITIONS 40
>              |
>     | STORED AS KUDU
>             |
>     | TBLPROPERTIES ('kudu.master_addresses'='host1,host2',
>                     'kudu.table_name'='impala::kudu_table') |
>     +-----------------------------------------------------------
> -----------------------+
>     ```
>
>


-- 
Todd Lipcon
Software Engineer, Cloudera

Mime
View raw message