Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 93F5F200C61 for ; Tue, 25 Apr 2017 09:48:32 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id 92693160BB3; Tue, 25 Apr 2017 07:48:32 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 85DF2160B9E for ; Tue, 25 Apr 2017 09:48:31 +0200 (CEST) Received: (qmail 82603 invoked by uid 500); 25 Apr 2017 07:48:30 -0000 Mailing-List: contact commits-help@carbondata.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@carbondata.apache.org Delivered-To: mailing list commits@carbondata.apache.org Received: (qmail 82594 invoked by uid 99); 25 Apr 2017 07:48:30 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd2-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 25 Apr 2017 07:48:30 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd2-us-west.apache.org (ASF Mail Server at spamd2-us-west.apache.org) with ESMTP id 9F0371B0727 for ; Tue, 25 Apr 2017 07:48:29 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd2-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: -4.221 X-Spam-Level: X-Spam-Status: No, score=-4.221 tagged_above=-999 required=6.31 tests=[KAM_ASCII_DIVIDERS=0.8, RCVD_IN_DNSWL_HI=-5, RCVD_IN_MSPIKE_H3=-0.01, RCVD_IN_MSPIKE_WL=-0.01, RP_MATCHES_RCVD=-0.001, SPF_PASS=-0.001, URIBL_BLOCKED=0.001] autolearn=disabled Received: from mx1-lw-eu.apache.org ([10.40.0.8]) by localhost (spamd2-us-west.apache.org [10.40.0.9]) (amavisd-new, port 10024) with ESMTP id kLJqaZ16780O for ; Tue, 25 Apr 2017 07:48:27 +0000 (UTC) Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by mx1-lw-eu.apache.org (ASF Mail Server at mx1-lw-eu.apache.org) with SMTP id 53FA15FBE5 for ; Tue, 25 Apr 2017 07:48:26 +0000 (UTC) Received: (qmail 82546 invoked by uid 99); 25 Apr 2017 07:48:25 -0000 Received: from git1-us-west.apache.org (HELO git1-us-west.apache.org) (140.211.11.23) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 25 Apr 2017 07:48:25 +0000 Received: by git1-us-west.apache.org (ASF Mail Server at git1-us-west.apache.org, from userid 33) id 41D1FE038B; Tue, 25 Apr 2017 07:48:25 +0000 (UTC) Content-Type: text/plain; charset="us-ascii" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit From: chenliang613@apache.org To: commits@carbondata.incubator.apache.org Date: Tue, 25 Apr 2017 07:48:25 -0000 Message-Id: <6f1c65f18730462dada385ba8970fbfc@git.apache.org> X-Mailer: ASF-Git Admin Mailer Subject: [1/2] incubator-carbondata git commit: docs-for-optimizing-mass-data-loading archived-at: Tue, 25 Apr 2017 07:48:32 -0000 Repository: incubator-carbondata Updated Branches: refs/heads/master 060b455a8 -> d695a8af4 docs-for-optimizing-mass-data-loading docs-for-optimizing-mass-data-loading Project: http://git-wip-us.apache.org/repos/asf/incubator-carbondata/repo Commit: http://git-wip-us.apache.org/repos/asf/incubator-carbondata/commit/7f8fd8e8 Tree: http://git-wip-us.apache.org/repos/asf/incubator-carbondata/tree/7f8fd8e8 Diff: http://git-wip-us.apache.org/repos/asf/incubator-carbondata/diff/7f8fd8e8 Branch: refs/heads/master Commit: 7f8fd8e84cc2c39d355ef3101bda73f03711a851 Parents: 060b455 Author: WilliamZhu Authored: Tue Apr 25 10:11:46 2017 +0800 Committer: WilliamZhu Committed: Tue Apr 25 10:17:58 2017 +0800 ---------------------------------------------------------------------- docs/useful-tips-on-carbondata.md | 85 +++++++++++++++++++++++----------- 1 file changed, 59 insertions(+), 26 deletions(-) ---------------------------------------------------------------------- http://git-wip-us.apache.org/repos/asf/incubator-carbondata/blob/7f8fd8e8/docs/useful-tips-on-carbondata.md ---------------------------------------------------------------------- diff --git a/docs/useful-tips-on-carbondata.md b/docs/useful-tips-on-carbondata.md index b1ff903..bfddf29 100644 --- a/docs/useful-tips-on-carbondata.md +++ b/docs/useful-tips-on-carbondata.md @@ -22,17 +22,19 @@ This tutorial guides you to create CarbonData Tables and optimize performance. The following sections will elaborate on the above topics : * [Suggestions to create CarbonData Table](#suggestions-to-create-carbondata-table) -* [Configurations For Optimizing CarbonData Performance](#configurations-for-optimizing-carbondata-performance) +* [Configuration for Optimizing Data Loading performance for Massive Data](#configuration-for-optimizing-data-loading-performance-for-massive-data) +* [Optimizing Mass Data Loading](#optimizing-mass-data-loading) + ## Suggestions to Create CarbonData Table Recently CarbonData was used to analyze performance of Telecommunication field. The results of the analysis for table creation with dimensions ranging from -10 thousand to 10 billion rows and 100 to 300 columns have been summarized below. +10 thousand to 10 billion rows and 100 to 300 columns have been summarized below. The following table describes some of the columns from the table used. - - + + **Table Column Description** | Column Name | Data Type | Cardinality | Attribution | @@ -51,29 +53,29 @@ CarbonData has more than 50 test cases, on the basis of these we have following * **Put the frequently-used column filter in the beginning** - For example, MSISDN filter is used in most of the query then we must put the MSISDN in the first column. + For example, MSISDN filter is used in most of the query then we must put the MSISDN in the first column. The create table command can be modified as suggested below : ``` create table carbondata_table( msisdn String, ... - )STORED BY 'org.apache.carbondata.format' + )STORED BY 'org.apache.carbondata.format' TBLPROPERTIES ( 'DICTIONARY_EXCLUDE'='MSISDN,..', 'DICTIONARY_INCLUDE'='...'); ``` - + Now the query with MSISDN in the filter will be more efficient. * **Put the frequently-used columns in the order of low to high cardinality** - + If the table in the specified query has multiple columns which are frequently used to filter the results, it is suggested to put - the columns in the order of cardinality low to high. This ordering of frequently used columns improves the compression ratio and + the columns in the order of cardinality low to high. This ordering of frequently used columns improves the compression ratio and enhances the performance of queries with filter on these columns. - - For example if MSISDN, HOST and Dime_1 are frequently-used columns, then the column order of table is suggested as - Dime_1>HOST>MSISDN as Dime_1 has the lowest cardinality. + + For example if MSISDN, HOST and Dime_1 are frequently-used columns, then the column order of table is suggested as + Dime_1>HOST>MSISDN as Dime_1 has the lowest cardinality. The create table command can be modified as suggested below : ``` @@ -82,7 +84,7 @@ The create table command can be modified as suggested below : HOST String, MSISDN String, ... - )STORED BY 'org.apache.carbondata.format' + )STORED BY 'org.apache.carbondata.format' TBLPROPERTIES ( 'DICTIONARY_EXCLUDE'='MSISDN,HOST..', 'DICTIONARY_INCLUDE'='Dime_1..'); ``` @@ -100,7 +102,7 @@ The create table command can be modified as below : HOST String, MSISDN String, ... - )STORED BY 'org.apache.carbondata.format' + )STORED BY 'org.apache.carbondata.format' TBLPROPERTIES ( 'DICTIONARY_EXCLUDE'='MSISDN,HOST,IMSI..', 'DICTIONARY_INCLUDE'='Dime_1,END_TIME,BEGIN_TIME..'); ``` @@ -108,7 +110,7 @@ The create table command can be modified as below : * **For measure type columns with non high accuracy, replace Numeric(20,0) data type with Double data type** - For columns of measure type, not requiring high accuracy, it is suggested to replace Numeric data type with Double to enhance + For columns of measure type, not requiring high accuracy, it is suggested to replace Numeric data type with Double to enhance query performance. The create table command can be modified as below : ``` @@ -121,17 +123,17 @@ query performance. The create table command can be modified as below : counter_2 double, ... counter_100 double - )STORED BY 'org.apache.carbondata.format' + )STORED BY 'org.apache.carbondata.format' TBLPROPERTIES ( 'DICTIONARY_EXCLUDE'='MSISDN,HOST,IMSI', 'DICTIONARY_INCLUDE'='Dime_1,END_TIME,BEGIN_TIME'); ``` The result of performance analysis of test-case shows reduction in query execution time from 15 to 3 seconds, thereby improving performance by nearly 5 times. - + * **Columns of incremental character should be re-arranged at the end of dimensions** Consider the following scenario where data is loaded each day and the start_time is incremental for each load, it is -suggested to put start_time at the end of dimensions. +suggested to put start_time at the end of dimensions. Incremental values are efficient in using min/max index. The create table command can be modified as below : @@ -145,25 +147,56 @@ suggested to put start_time at the end of dimensions. BEGIN_TIME bigint, ... counter_100 double - )STORED BY 'org.apache.carbondata.format' + )STORED BY 'org.apache.carbondata.format' TBLPROPERTIES ( 'DICTIONARY_EXCLUDE'='MSISDN,HOST,IMSI', - 'DICTIONARY_INCLUDE'='Dime_1,END_TIME,BEGIN_TIME'); + 'DICTIONARY_INCLUDE'='Dime_1,END_TIME,BEGIN_TIME'); ``` * **Avoid adding high cardinality columns to dictionary** - If the system has low memory configuration, then it is suggested to exclude high cardinality columns from the dictionary to -enhance load performance. Creation of dictionary for high cardinality columns at time of load will degrade load performance due to -excessive memory usage. + If the system has low memory configuration, then it is suggested to exclude high cardinality columns from the dictionary to +enhance load performance. Creation of dictionary for high cardinality columns at time of load will degrade load performance due to +excessive memory usage. By default CarbonData determines the cardinality at the first data load and allows for dictionary creation only if the cardinality is less than 1 million. + +## Configuration for Optimizing Data Loading performance for Massive Data + + + CarbonData supports large data load, in this process sorting data while loading consumes a lot of memory and disk IO and + this can result sometimes in "Out Of Memory" exception. + If you do not have much memory to use, then you may prefer to slow the speed of data loading instead of data load failure. + You can configure CarbonData by tuning following properties in carbon.properties file to get a better performance.: + +| Parameter | Default Value | Description/Tuning | +|-----------|-------------|--------| +|carbon.number.of.cores.while.loading|Default: 2.This value should be >= 2|Specifies the number of cores used for data processing during data loading in CarbonData. | +|carbon.sort.size|Data loading|Default: 100000. The value should be >= 100.|Threshhold to write local file in sort step when loading data| +|carbon.sort.file.write.buffer.size|Default: 50000.|DataOutputStream buffer. | +|carbon.number.of.cores.block.sort|Default: 7 | If you have huge memory and cpus, increase it as you will| +|carbon.merge.sort.reader.thread|Default: 3 |Specifies the number of cores used for temp file merging during data loading in CarbonData.| +|carbon.merge.sort.prefetch|Default: true | You may want set this value to false if you have not enough memory| + + +For example, if there are 10 million records ,and i have only 16 cores ,64GB memory, will be loaded to CarbonData table. +Using the default configuration always fail in sort step. Modify carbon.properties as suggested below + + +``` +carbon.number.of.cores.block.sort=1 +carbon.merge.sort.reader.thread=1 +carbon.sort.size=5000 +carbon.sort.file.write.buffer.size=5000 +carbon.merge.sort.prefetch=false +``` + ## Configurations for Optimizing CarbonData Performance -Recently we did some performance POC on CarbonData for Finance and telecommunication Field. It involved detailed queries and aggregation +Recently we did some performance POC on CarbonData for Finance and telecommunication Field. It involved detailed queries and aggregation scenarios. After the completion of POC, some of the configurations impacting the performance have been identified and tabulated below : | Parameter | Location | Used For | Description | Tuning | @@ -176,5 +209,5 @@ scenarios. After the completion of POC, some of the configurations impacting the | carbon.detail.batch.size | spark/carbonlib/carbon.properties | Data loading | The buffer size to store records, returned from the block scan. | In limit scenario this parameter is very important. For example your query limit is 1000. But if we set this value to 3000 that means we get 3000 records from scan but spark will only take 1000 rows. So the 2000 remaining are useless. In one Finance test case after we set it to 100, in the limit 1000 scenario the performance increase about 2 times in comparison to if we set this value to 12000. | | carbon.use.local.dir | spark/carbonlib/carbon.properties | Data loading | Whether use YARN local directories for multi-table load disk load balance | If this is set it to true CarbonData will use YARN local directories for multi-table load disk load balance, that will improve the data load performance. | - - \ No newline at end of file +Note: If your CarbonData instance is provided only for query, you may specify the conf 'spark.speculation=true' which is conf + in spark. \ No newline at end of file