carbondata-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jacky Li (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (CARBONDATA-458) Improving carbon first time query performance
Date Thu, 01 Dec 2016 09:53:58 GMT

    [ https://issues.apache.org/jira/browse/CARBONDATA-458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15711514#comment-15711514
] 

Jacky Li commented on CARBONDATA-458:
-------------------------------------

This work is not about merging footers into a central file, it is about re-orgnaizing the
internal structure of carbon file to make it faster when doing the first time query. I think
the biggest bottle net is the 3rd and 4th of those Vishal has pointed out.

3. Carbon reading more footer data than its required(data chunk)
4. There are lots of random seek is happening in carbon as column data(data page, rle, inverted
index) are not stored together.

>  Improving carbon first time query performance
> ----------------------------------------------
>
>                 Key: CARBONDATA-458
>                 URL: https://issues.apache.org/jira/browse/CARBONDATA-458
>             Project: CarbonData
>          Issue Type: Improvement
>          Components: core, data-load, data-query
>            Reporter: kumar vishal
>            Assignee: kumar vishal
>          Time Spent: 4h 20m
>  Remaining Estimate: 0h
>
> Improving carbon first time query performance
> Reason:
> 1. As file system cache is cleared file reading will make it slower to read and cache
> 2. In first time query carbon will have to read the footer from file data file to form
the btree
> 3. Carbon reading more footer data than its required(data chunk)
> 4. There are lots of random seek is happening in carbon as column data(data page, rle,
inverted index) are not stored together.
> Solution: 
> 1. Improve block loading time. This can be done by removing data chunk from blockletInfo
and storing only offset and length of data chunk
> 2. compress presence meta bitset stored for null values for measure column using snappy

> 3. Store the metadata and data of a column together and read together this reduces random
seek and improve IO



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message