carbondata-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jihong MA (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (CARBONDATA-431) Improve compression ratio for numeric datatype
Date Fri, 16 Dec 2016 01:38:58 GMT

     [ https://issues.apache.org/jira/browse/CARBONDATA-431?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Jihong MA updated CARBONDATA-431:
---------------------------------
    Description: 
Carbon has better compression ratio for String type, but worst for numeric data type, identify
issues with current numeric datatype compression for carbon to get better compression ratio.

DataType	    Text	Parquet	  Orc 	Carbon
decimal	  16G  |	11G      |	 6G	   |    13G
int	          5G	   |     1G	     |    1G	   |    3G
String 	  24G  |	22G	     |    11G   |	 3G   (no dictionary)       -------    high cardinality
String	30G    |	4G	     |    4G	   |    1G  -- Dictionary encode            1G  -- Dictionary
encode without inverted index            3G  -- No dictionary encode              -----------low
cardinality


  was:
For the data type, carbon's string type has better compression ratio, but for numeric type,
orc has the best compression. we should analysis numeric datatype for carbon to get better
compression ratio

DataType	    Text	Parquet	  Orc 	Carbon
decimal	  16G  |	11G      |	 6G	   |    13G
int	          5G	   |     1G	     |    1G	   |    3G
String 	  24G  |	22G	     |    11G   |	 3G   (no dictionary)       -------    high cardinality
String	30G    |	4G	     |    4G	   |    1G  -- Dictionary encode            1G  -- Dictionary
encode without inverted index            3G  -- No dictionary encode              -----------low
cardinality


        Summary: Improve compression ratio for numeric datatype   (was: Analysis compression
for numeric datatype compared with Parquet/ORC)

> Improve compression ratio for numeric datatype 
> -----------------------------------------------
>
>                 Key: CARBONDATA-431
>                 URL: https://issues.apache.org/jira/browse/CARBONDATA-431
>             Project: CarbonData
>          Issue Type: Sub-task
>            Reporter: suo tong
>            Assignee: Ashok Kumar
>             Fix For: 1.0.0-incubating
>
>          Time Spent: 2h 50m
>  Remaining Estimate: 0h
>
> Carbon has better compression ratio for String type, but worst for numeric data type,
identify issues with current numeric datatype compression for carbon to get better compression
ratio.
> DataType	    Text	Parquet	  Orc 	Carbon
> decimal	  16G  |	11G      |	 6G	   |    13G
> int	          5G	   |     1G	     |    1G	   |    3G
> String 	  24G  |	22G	     |    11G   |	 3G   (no dictionary)       -------    high
cardinality
> String	30G    |	4G	     |    4G	   |    1G  -- Dictionary encode            1G  -- Dictionary
encode without inverted index            3G  -- No dictionary encode              -----------low
cardinality



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message