From issues-return-30687-archive-asf-public=cust-asf.ponee.io@carbondata.apache.org Fri Jan 12 08:28:03 2018 Return-Path: X-Original-To: archive-asf-public@eu.ponee.io Delivered-To: archive-asf-public@eu.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by mx-eu-01.ponee.io (Postfix) with ESMTP id C7E86180621 for ; Fri, 12 Jan 2018 08:28:03 +0100 (CET) Received: by cust-asf.ponee.io (Postfix) id B7989160C33; Fri, 12 Jan 2018 07:28:03 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 0A3D5160C30 for ; Fri, 12 Jan 2018 08:28:02 +0100 (CET) Received: (qmail 46702 invoked by uid 500); 12 Jan 2018 07:28:02 -0000 Mailing-List: contact issues-help@carbondata.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@carbondata.apache.org Delivered-To: mailing list issues@carbondata.apache.org Received: (qmail 46692 invoked by uid 99); 12 Jan 2018 07:28:02 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd2-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 12 Jan 2018 07:28:02 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd2-us-west.apache.org (ASF Mail Server at spamd2-us-west.apache.org) with ESMTP id ADD601A02B3 for ; Fri, 12 Jan 2018 07:28:01 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd2-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: -99.911 X-Spam-Level: X-Spam-Status: No, score=-99.911 tagged_above=-999 required=6.31 tests=[KAM_ASCII_DIVIDERS=0.8, RCVD_IN_DNSWL_LOW=-0.7, SPF_PASS=-0.001, T_RP_MATCHES_RCVD=-0.01, USER_IN_WHITELIST=-100] autolearn=disabled Received: from mx1-lw-us.apache.org ([10.40.0.8]) by localhost (spamd2-us-west.apache.org [10.40.0.9]) (amavisd-new, port 10024) with ESMTP id f3Xih0_kdmkR for ; Fri, 12 Jan 2018 07:28:01 +0000 (UTC) Received: from mailrelay1-us-west.apache.org (mailrelay1-us-west.apache.org [209.188.14.139]) by mx1-lw-us.apache.org (ASF Mail Server at mx1-lw-us.apache.org) with ESMTP id B1DA35F24A for ; Fri, 12 Jan 2018 07:28:00 +0000 (UTC) Received: from jira-lw-us.apache.org (unknown [207.244.88.139]) by mailrelay1-us-west.apache.org (ASF Mail Server at mailrelay1-us-west.apache.org) with ESMTP id 4D492E0594 for ; Fri, 12 Jan 2018 07:28:00 +0000 (UTC) Received: from jira-lw-us.apache.org (localhost [127.0.0.1]) by jira-lw-us.apache.org (ASF Mail Server at jira-lw-us.apache.org) with ESMTP id 0C51D25BD1 for ; Fri, 12 Jan 2018 07:28:00 +0000 (UTC) Date: Fri, 12 Jan 2018 07:28:00 +0000 (UTC) From: "xuchuanyin (JIRA)" To: issues@carbondata.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (CARBONDATA-2023) Optimization in data loading for skewed data MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/CARBONDATA-2023?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16323646#comment-16323646 ] xuchuanyin commented on CARBONDATA-2023: ---------------------------------------- As for this feature, there are two options: 1. Add an option for skewed data loading, let the user enable this feature if needed 2. Use adaptive strategy, let carbondata decide whether to enable this feature at fly. I prefer to the 2nd option and will implement the first option first. > Optimization in data loading for skewed data > -------------------------------------------- > > Key: CARBONDATA-2023 > URL: https://issues.apache.org/jira/browse/CARBONDATA-2023 > Project: CarbonData > Issue Type: Improvement > Components: data-load > Affects Versions: 1.3.0 > Reporter: xuchuanyin > Assignee: xuchuanyin > > In one of my cases, carbondata has to load skewed data files. The size of data file ranges from 1KB to about 5GB. > In current implementation, carbondata will distribute the file blocks(splits) among the nodes to maximum the data locality and data evenly distributed, we call it `block-node-assignment` for short. > However, the current implementation has some problems. > The assignment is block number based. The goal is to make sure that all the nodes deal the same amount number of blocks. In the skewed data scenario described above, the block of a small file and the block of a big file are very different from its size (1KB v.s. 64MB). As a result, the difference of total data size assigned for each data node is very large. > In order to solve this problem, the size of block should be considered during the block-node-assignment. One node can deal more blocks than another as long as the total size of blocks are almost the same. -- This message was sent by Atlassian JIRA (v6.4.14#64029)