carbondata-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From chenliang613 <>
Subject [GitHub] carbondata pull request #1177: [CARBONDATA-1281] Support multiple temp dirs ...
Date Thu, 20 Jul 2017 15:52:24 GMT
Github user chenliang613 commented on a diff in the pull request:
    --- Diff: docs/ ---
    @@ -231,5 +231,6 @@ scenarios. After the completion of POC, some of the configurations
impacting the
     | spark.executor.instances/spark.executor.cores/spark.executor.memory | spark/conf/spark-defaults.conf
| Querying | The number of executors, CPU cores, and memory used for CarbonData query. | In
the bank scenario, we provide the 4 CPUs cores and 15 GB for each executor which can get good
performance. This 2 value does not mean more the better. It needs to be configured properly
in case of limited resources. For example, In the bank scenario, it has enough CPU 32 cores
each node but less memory 64 GB each node. So we cannot give more CPU but less memory. For
example, when 4 cores and 12GB for each executor. It sometimes happens GC during the query
which impact the query performance very much from the 3 second to more than 15 seconds. In
this scenario need to increase the memory or decrease the CPU cores. |
     | carbon.detail.batch.size | spark/carbonlib/ | Data loading | The buffer
size to store records, returned from the block scan. | In limit scenario this parameter is
very important. For example your query limit is 1000. But if we set this value to 3000 that
means we get 3000 records from scan but spark will only take 1000 rows. So the 2000 remaining
are useless. In one Finance test case after we set it to 100, in the limit 1000 scenario the
performance increase about 2 times in comparison to if we set this value to 12000. |
     | carbon.use.local.dir | spark/carbonlib/ | Data loading | Whether use
YARN local directories for multi-table load disk load balance | If this is set it to true
CarbonData will use YARN local directories for multi-table load disk load balance, that will
improve the data load performance. |
    +| carbon.use.multiple.temp.dir | spark/carbonlib/ | Data loading | Whether
to use multiple YARN local directories during table data loading for disk load balance | After
enabling 'carbon.use.local.dir', if this is set to true, CarbonData will use YARN local directories
during data load for disk load balance, that will improve the data load performance. Please
enable this property especially when you encounter disk hotspot problem during data loading.
    --- End diff --
    So , by default, you propose to set "true" or "false"?

If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at or file a JIRA ticket
with INFRA.

View raw message