carbondata-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From jackylk <...@git.apache.org>
Subject [GitHub] carbondata pull request #1265: [CARBONDATA-1128] Add direct string encoding ...
Date Fri, 18 Aug 2017 06:16:33 GMT
GitHub user jackylk opened a pull request:

    https://github.com/apache/carbondata/pull/1265

    [CARBONDATA-1128] Add direct string encoding for short string column

    For short string columns less than 128 bytes, add a new encoding to improve compression
and loading speed.
    DirectStringCodec encode the input column by two array:
    1. one for string content, stored in data page of DataChunk2
    2. another for string length, stored in EncoderMeta in DataChunk2
    They are compressed separately by compressor. 
    
    I have tested using TPC-H generated data (1GB)
    1.  For high cardinality columns (L_COMENT in LINEITEM table, distinct value is 4580667)
    CREATE TABLE LINEITEM (
    	L_COMMENT		VARCHAR(44)
     )
     STORED BY 'carbondata'
    
    - Use direct string encoding
    loading time: 12496 ms
    size: 42M
    
    - Use existing encoding
    loading time: 12230 ms
    size: 45M
    
    2.  For low cardinality columns (L_SHIPMODE in LINEITEM table, distinct value is 7)
    CREATE TABLE LINEITEM (
    	L_SHIPMODE		CHAR(10)
    )
    STORED BY 'carbondata'
    
    - Use direct string encoding
    loading time: 6089 ms
    size: 1.4M
    
    - Use existing encoding
    loading time: 6556 ms
    size: 1.7M

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/jackylk/incubator-carbondata direct_string

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/carbondata/pull/1265.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #1265
    
----
commit 828f108fc5c312a087f80e4470cae7666293fc3f
Author: Jacky Li <jacky.likun@qq.com>
Date:   2017-08-17T01:57:43Z

    add integral rle codec

commit 37d1c0977220eab37ba4695e0585e0045258e702
Author: Jacky Li <jacky.likun@qq.com>
Date:   2017-08-17T13:32:02Z

    decode by meta

commit 05eaa0b6ef64ba23425259006e3c57fd867f95c4
Author: Jacky Li <jacky.likun@qq.com>
Date:   2017-08-18T06:05:38Z

    add direct string codec for short string column

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

Mime
View raw message