carbondata-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "xuchuanyin (JIRA)" <>
Subject [jira] [Created] (CARBONDATA-2023) Optimization data loading for skewed data
Date Fri, 12 Jan 2018 07:13:00 GMT
xuchuanyin created CARBONDATA-2023:

             Summary: Optimization data loading for skewed data
                 Key: CARBONDATA-2023
             Project: CarbonData
          Issue Type: Improvement
          Components: data-load
    Affects Versions: 1.3.0
            Reporter: xuchuanyin
            Assignee: xuchuanyin

In one of my cases, carbondata has to load skewed data files. The size of data file ranges
from 1KB to about 5GB.

In current implementation, carbondata will distribute the file blocks(splits) among the nodes
to maximum the data locality and data evenly distributed, we call it `block-node-assignment`
for short.

However, the current implementation has some problems.

The assignment is block number based. The goal is to make sure that all the nodes deal the
same amount number of blocks. In the skewed data scenario described above, the block of a
small file and the block of a big file are very different from its size (1KB v.s. 64MB). As
a result, the difference of total data size assigned for each data node is very large.

In order to solve this problem, the size of block should be considered during the block-node-assignment.
One node can deal more blocks than another as long as the total size of blocks are almost
the same.

This message was sent by Atlassian JIRA

View raw message