carbondata-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jacky Li (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (CARBONDATA-431) Analysis compression for numeric datatype compared with Parquet/ORC
Date Tue, 13 Dec 2016 07:36:58 GMT

     [ https://issues.apache.org/jira/browse/CARBONDATA-431?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Jacky Li updated CARBONDATA-431:
--------------------------------
    Fix Version/s: 1.0.0-incubating

> Analysis compression for numeric datatype compared with Parquet/ORC
> -------------------------------------------------------------------
>
>                 Key: CARBONDATA-431
>                 URL: https://issues.apache.org/jira/browse/CARBONDATA-431
>             Project: CarbonData
>          Issue Type: Sub-task
>            Reporter: suo tong
>            Assignee: Ashok Kumar
>             Fix For: 1.0.0-incubating
>
>          Time Spent: 2h 50m
>  Remaining Estimate: 0h
>
> For the data type, carbon's string type has better compression ratio, but for numeric
type, orc has the best compression. we should analysis numeric datatype for carbon to get
better compression ratio
> DataType	    Text	Parquet	  Orc 	Carbon
> decimal	  16G  |	11G      |	 6G	   |    13G
> int	          5G	   |     1G	     |    1G	   |    3G
> String 	  24G  |	22G	     |    11G   |	 3G   (no dictionary)       -------    high
cardinality
> String	30G    |	4G	     |    4G	   |    1G  -- Dictionary encode            1G  -- Dictionary
encode without inverted index            3G  -- No dictionary encode              -----------low
cardinality



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message