carbondata-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From ravipes...@apache.org
Subject [1/8] carbondata git commit: Fixed linking and content issues
Date Sat, 24 Jun 2017 06:39:11 GMT
Repository: carbondata
Updated Branches:
  refs/heads/branch-1.1 8ceb069ed -> 6a63c06d1


Fixed linking and content issues


Project: http://git-wip-us.apache.org/repos/asf/carbondata/repo
Commit: http://git-wip-us.apache.org/repos/asf/carbondata/commit/26de8ea7
Tree: http://git-wip-us.apache.org/repos/asf/carbondata/tree/26de8ea7
Diff: http://git-wip-us.apache.org/repos/asf/carbondata/diff/26de8ea7

Branch: refs/heads/branch-1.1
Commit: 26de8ea7efe6f7c06debbaf9eb06fc2125ec2d90
Parents: 8ceb069
Author: jatin <jatin.demla@knoldus.in>
Authored: Thu Jun 15 13:18:48 2017 +0530
Committer: ravipesala <ravi.pesala@gmail.com>
Committed: Sat Jun 24 10:19:00 2017 +0530

----------------------------------------------------------------------
 docs/faq.md                       |  2 +-
 docs/useful-tips-on-carbondata.md | 23 +++++++++++------------
 2 files changed, 12 insertions(+), 13 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/carbondata/blob/26de8ea7/docs/faq.md
----------------------------------------------------------------------
diff --git a/docs/faq.md b/docs/faq.md
index 88db7d5..45fd960 100644
--- a/docs/faq.md
+++ b/docs/faq.md
@@ -80,7 +80,7 @@ The property carbon.lock.type configuration specifies the type of lock to
be acq
 In order to build CarbonData project it is necessary to specify the spark profile. The spark
profile sets the Spark Version. You need to specify the ``spark version`` while using Maven
to build project.
 
 ## How Carbon will behave when execute insert operation in abnormal scenarios?
-Carbon support insert operation, you can refer to the syntax mentioned in [DML Operations
on CarbonData](http://carbondata.apache.org/dml-operation-on-carbondata).
+Carbon support insert operation, you can refer to the syntax mentioned in [DML Operations
on CarbonData](dml-operation-on-carbondata.md).
 First, create a soucre table in spark-sql and load data into this created table.
 
 ```

http://git-wip-us.apache.org/repos/asf/carbondata/blob/26de8ea7/docs/useful-tips-on-carbondata.md
----------------------------------------------------------------------
diff --git a/docs/useful-tips-on-carbondata.md b/docs/useful-tips-on-carbondata.md
index 40a3947..06bc12b 100644
--- a/docs/useful-tips-on-carbondata.md
+++ b/docs/useful-tips-on-carbondata.md
@@ -23,7 +23,7 @@ The following sections will elaborate on the above topics :
 
 * [Suggestions to create CarbonData Table](#suggestions-to-create-carbondata-table)
 * [Configuration for Optimizing Data Loading performance for Massive Data](#configuration-for-optimizing-data-loading-performance-for-massive-data)
-* [Optimizing Mass Data Loading](#optimizing-mass-data-loading)
+* [Optimizing Mass Data Loading](#configurations-for-optimizing-carbondata-performance)
 
 
 ## Suggestions to Create CarbonData Table
@@ -98,7 +98,7 @@ The create table command can be modified as below :
 ```
   create table carbondata_table(
   Dime_1 String,
-  BEGIN_TIME bigint
+  BEGIN_TIME bigint,
   HOST String,
   MSISDN String,
   ...
@@ -116,7 +116,7 @@ query performance. The create table command can be modified as below :
 ```
   create table carbondata_table(
   Dime_1 String,
-  BEGIN_TIME bigint
+  BEGIN_TIME bigint,
   HOST String,
   MSISDN String,
   counter_1 double,
@@ -132,8 +132,8 @@ query performance. The create table command can be modified as below :
 
 * **Columns of incremental character should be re-arranged at the end of dimensions**
 
-  Consider the following scenario where data is loaded each day and the start_time is incremental
for each load, it is
-suggested to put start_time at the end of dimensions.
+  Consider the following scenario where data is loaded each day and the begin_time is incremental
for each load, it is
+suggested to put begin_time at the end of dimensions.
 
   Incremental values are efficient in using min/max index. The create table command can be
modified as below :
 
@@ -170,12 +170,12 @@ excessive memory usage.
  CarbonData supports large data load, in this process sorting data while loading consumes
a lot of memory and disk IO and
  this can result sometimes in "Out Of Memory" exception.
  If you do not have much memory to use, then you may prefer to slow the speed of data loading
instead of data load failure.
- You can configure CarbonData by tuning following properties in carbon.properties file to
get a better performance.:
+ You can configure CarbonData by tuning following properties in carbon.properties file to
get a better performance.
 
 | Parameter | Default Value | Description/Tuning |
 |-----------|-------------|--------|
 |carbon.number.of.cores.while.loading|Default: 2.This value should be >= 2|Specifies the
number of cores used for data processing during data loading in CarbonData. |
-|carbon.sort.size|Default: 100000. The value should be >= 100.|Threshhold to write local
file in sort step when loading data|
+|carbon.sort.size|Default: 100000. The value should be >= 100.|Threshold to write local
file in sort step when loading data|
 |carbon.sort.file.write.buffer.size|Default:  50000.|DataOutputStream buffer. |
 |carbon.number.of.cores.block.sort|Default: 7 | If you have huge memory and cpus, increase
it as you will|
 |carbon.merge.sort.reader.thread|Default: 3 |Specifies the number of cores used for temp
file merging during data loading in CarbonData.|
@@ -183,7 +183,7 @@ excessive memory usage.
 
 
 For example, if there are  10 million records ,and i have only 16 cores ,64GB memory, will
be loaded to CarbonData table.
-Using the default configuration  always fail in sort step. Modify carbon.properties as suggested
below
+Using the default configuration  always fail in sort step. Modify carbon.properties as suggested
below:
 
 
 ```
@@ -204,10 +204,9 @@ scenarios. After the completion of POC, some of the configurations impacting
the
 | carbon.sort.intermediate.files.limit | spark/carbonlib/carbon.properties | Data loading
| During the loading of data, local temp is used to sort the data. This number specifies the
minimum number of intermediate files after which the  merge sort has to be initiated. | Increasing
the parameter to a higher value will improve the load performance. For example, when we increase
the value from 20 to 100, it increases the data load performance from 35MB/S to more than
50MB/S. Higher values of this parameter consumes  more memory during the load. |
 | carbon.number.of.cores.while.loading | spark/carbonlib/carbon.properties | Data loading
| Specifies the number of cores used for data processing during data loading in CarbonData.
| If you have more number of CPUs, then you can increase the number of CPUs, which will increase
the performance. For example if we increase the value from 2 to 4 then the CSV reading performance
can increase about 1 times |
 | carbon.compaction.level.threshold | spark/carbonlib/carbon.properties | Data loading and
Querying | For minor compaction, specifies the number of segments to be merged in stage 1
and number of compacted segments to be merged in stage 2. | Each CarbonData load will create
one segment, if every load is small in size it will generate many small file over a period
of time impacting the query performance. Configuring this parameter will merge the small segment
to one big segment which will sort the data and improve the performance. For Example in one
telecommunication scenario, the performance improves about 2 times after minor compaction.
|
-| spark.sql.shuffle.partitions | spark/con/spark-defaults.conf | Querying | The number of
task started when spark shuffle. | The value can be 1 to 2 times as much as the executor cores.
In an aggregation scenario, reducing the number from 200 to 32 reduced the query time from
17 to 9 seconds. |
-| num-executors/executor-cores/executor-memory | spark/con/spark-defaults.conf | Querying
| The number of executors, CPU cores, and memory used for CarbonData query. | In the bank
scenario, we provide the 4 CPUs cores and 15 GB for each executor which can get good performance.
This 2 value does not mean more the better. It needs to be configured properly in case of
limited resources. For example, In the bank scenario, it has enough CPU 32 cores each node
but less memory 64 GB each node. So we cannot give more CPU but less memory. For example,
when 4 cores and 12GB for each executor. It sometimes happens GC during the query which impact
the query performance very much from the 3 second to more than 15 seconds. In this scenario
need to increase the memory or decrease the CPU cores. |
+| spark.sql.shuffle.partitions | spark/conf/spark-defaults.conf | Querying | The number of
task started when spark shuffle. | The value can be 1 to 2 times as much as the executor cores.
In an aggregation scenario, reducing the number from 200 to 32 reduced the query time from
17 to 9 seconds. |
+| spark.executor.instances/spark.executor.cores/spark.executor.memory | spark/conf/spark-defaults.conf
| Querying | The number of executors, CPU cores, and memory used for CarbonData query. | In
the bank scenario, we provide the 4 CPUs cores and 15 GB for each executor which can get good
performance. This 2 value does not mean more the better. It needs to be configured properly
in case of limited resources. For example, In the bank scenario, it has enough CPU 32 cores
each node but less memory 64 GB each node. So we cannot give more CPU but less memory. For
example, when 4 cores and 12GB for each executor. It sometimes happens GC during the query
which impact the query performance very much from the 3 second to more than 15 seconds. In
this scenario need to increase the memory or decrease the CPU cores. |
 | carbon.detail.batch.size | spark/carbonlib/carbon.properties | Data loading | The buffer
size to store records, returned from the block scan. | In limit scenario this parameter is
very important. For example your query limit is 1000. But if we set this value to 3000 that
means we get 3000 records from scan but spark will only take 1000 rows. So the 2000 remaining
are useless. In one Finance test case after we set it to 100, in the limit 1000 scenario the
performance increase about 2 times in comparison to if we set this value to 12000. |
 | carbon.use.local.dir | spark/carbonlib/carbon.properties | Data loading | Whether use YARN
local directories for multi-table load disk load balance | If this is set it to true CarbonData
will use YARN local directories for multi-table load disk load balance, that will improve
the data load performance. |
 
-Note: If your CarbonData instance is provided only for query, you may specify the conf 'spark.speculation=true'
which is conf
- in spark.
\ No newline at end of file
+Note: If your CarbonData instance is provided only for query, you may specify the property
'spark.speculation=true' which is in conf directory of spark.


Mime
View raw message