carbondata-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Pallavi Singh (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (CARBONDATA-726) Update with V3 format for better IO and processing optimization.
Date Tue, 30 May 2017 12:25:04 GMT

    [ https://issues.apache.org/jira/browse/CARBONDATA-726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16029343#comment-16029343
] 

Pallavi Singh edited comment on CARBONDATA-726 at 5/30/17 12:24 PM:
--------------------------------------------------------------------

There was Documentation Impact for the issue. Raised and fixed that in JIRA ISSUE: 1084
(https://issues.apache.org/jira/browse/CARBONDATA-1084)


was (Author: pallavisingh_09):
There was Documentation Impact for the issue. Raised and fixed that in JIRA ISSUE: 1084

> Update with V3 format for better IO and processing optimization.
> ----------------------------------------------------------------
>
>                 Key: CARBONDATA-726
>                 URL: https://issues.apache.org/jira/browse/CARBONDATA-726
>             Project: CarbonData
>          Issue Type: Improvement
>            Reporter: Ravindra Pesala
>             Fix For: 1.1.0
>
>          Time Spent: 10h 10m
>  Remaining Estimate: 0h
>
> Problems in current format.
> 1. IO read is slower since it needs to go for multiple seeks on the file to read column
blocklets. Current size of blocklet is 120000, so it needs to read multiple times from file
to scan the data on that column. Alternatively we can increase the blocklet size but it suffers
for filter queries as it gets big blocklet to filter.
> 2. Decompression is slower in current format, we are using inverted index for faster
filter queries and using NumberCompressor to compress the inverted index in bit wise packing.
It becomes slower so we should avoid number compressor. One alternative is to keep blocklet
size with in 32000 so that inverted index can be written with short, but IO read suffers a
lot.
> To overcome from above 2 issues we are introducing new format V3.
> Here each blocklet has multiple pages with size 32000, number of pages in blocklet is
configurable. Since we keep the page with in short limit so no need compress the inverted
index here.
> And maintain the max/min for each page to further prune the filter queries.
> Read the blocklet with pages at once and keep in offheap memory.
> During filter first check the max/min range and if it is valid then go for decompressing
the page to filter further.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Mime
View raw message