phoenix-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "James Taylor (JIRA)" <>
Subject [jira] [Commented] (PHOENIX-3559) More disk space used with encoded column scheme with data in sparse columns
Date Wed, 04 Jan 2017 01:07:58 GMT


James Taylor commented on PHOENIX-3559:

The encoding scheme isn't optimized for sparse storage. The idea would be to use it if your
storage is dense. Potentially you could use the column encoding scheme but still use multiple
key values which would be a good choice for sparse data. You'd want to use realistic column
names for a test like this (instead of c1, c2, c3) as that's where you'd get some space savings.
It'd be good to determine where the break even point is in terms of sparseness.

We could potentially improve our new storage format for sparse storage, but I'm not sure we'll
find one optimum format for both dense and sparse storage. Enabling new storage formats to
be defined will be valuable for this reason.

> More disk space used with encoded column scheme with data in sparse columns
> ---------------------------------------------------------------------------
>                 Key: PHOENIX-3559
>                 URL:
>             Project: Phoenix
>          Issue Type: Sub-task
>            Reporter: Mujtaba Chohan
>            Assignee: Samarth Jain
>             Fix For: 4.10.0
> Schema with 5K columns
> {noformat}
> create table (k1 integer, k2 integer, c1 varchar ... c5000 varchar CONSTRAINT PK PRIMARY
KEY (K1, K2)) 
> {noformat}
> In this schema, only 100 random columns are filled with random 15 chars. Rest are nulls.
> Data size is *6X* larger with encoded columns scheme compare to non-encoded. That is
12GB/1M rows encoded vs ~2GB/1M rows non-encoded.
> When compressed GZ, size with encoded column scheme is still 35% higher.

This message was sent by Atlassian JIRA

View raw message