Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 422BB200D6C for ; Wed, 20 Dec 2017 04:43:05 +0100 (CET) Received: by cust-asf.ponee.io (Postfix) id 40493160C2A; Wed, 20 Dec 2017 03:43:05 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 5FDCA160C1B for ; Wed, 20 Dec 2017 04:43:04 +0100 (CET) Received: (qmail 78135 invoked by uid 500); 20 Dec 2017 03:43:03 -0000 Mailing-List: contact issues-help@carbondata.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@carbondata.apache.org Delivered-To: mailing list issues@carbondata.apache.org Received: (qmail 78124 invoked by uid 99); 20 Dec 2017 03:43:03 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd2-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 20 Dec 2017 03:43:03 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd2-us-west.apache.org (ASF Mail Server at spamd2-us-west.apache.org) with ESMTP id 1343C1A09C2 for ; Wed, 20 Dec 2017 03:43:03 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd2-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: -99.201 X-Spam-Level: X-Spam-Status: No, score=-99.201 tagged_above=-999 required=6.31 tests=[KAM_ASCII_DIVIDERS=0.8, RCVD_IN_DNSWL_NONE=-0.0001, RP_MATCHES_RCVD=-0.001, SPF_PASS=-0.001, URIBL_BLOCKED=0.001, USER_IN_WHITELIST=-100] autolearn=disabled Received: from mx1-lw-us.apache.org ([10.40.0.8]) by localhost (spamd2-us-west.apache.org [10.40.0.9]) (amavisd-new, port 10024) with ESMTP id D5cQ-zFm24p0 for ; Wed, 20 Dec 2017 03:43:01 +0000 (UTC) Received: from mailrelay1-us-west.apache.org (mailrelay1-us-west.apache.org [209.188.14.139]) by mx1-lw-us.apache.org (ASF Mail Server at mx1-lw-us.apache.org) with ESMTP id DA0D75F3CE for ; Wed, 20 Dec 2017 03:43:00 +0000 (UTC) Received: from jira-lw-us.apache.org (unknown [207.244.88.139]) by mailrelay1-us-west.apache.org (ASF Mail Server at mailrelay1-us-west.apache.org) with ESMTP id 505A1E0140 for ; Wed, 20 Dec 2017 03:43:00 +0000 (UTC) Received: from jira-lw-us.apache.org (localhost [127.0.0.1]) by jira-lw-us.apache.org (ASF Mail Server at jira-lw-us.apache.org) with ESMTP id 0CF84212F7 for ; Wed, 20 Dec 2017 03:43:00 +0000 (UTC) Date: Wed, 20 Dec 2017 03:43:00 +0000 (UTC) From: "xuchuanyin (JIRA)" To: issues@carbondata.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (CARBONDATA-1839) Data load failed when using compressed sort temp file MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 archived-at: Wed, 20 Dec 2017 03:43:05 -0000 [ https://issues.apache.org/jira/browse/CARBONDATA-1839?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16297847#comment-16297847 ] xuchuanyin commented on CARBONDATA-1839: ---------------------------------------- Recently I found the bug in compressing sort temp file and tried to fix this bug in PR#1632 (https://github.com/apache/carbondata/pull/1632). In this PR, Carbondata will compress the records in batch and write the compressed content to file if we turn on this feature. However, I found that the GC performance is terrible. In my scenario, about half of the time were wasted in GC. And the overall performance is worse than before. I think the problem may lie in compressing the records by batch. Instead of this, I propose to compress the sort temp file in file level, not in record-batch level. 1. Compared with uncompressed ones, compressing the file in record-batch level leads to different layout of file. And it also affects the reading/writing behavior. (The compressed: |total_entry_number|batch_entry_numer|compressed_length|compressed_content|batch_entry_numer|compressed_length|compressed_content|...; The uncompressed: |total_entry_number|record|record|...;) 2. During compressing/uncompressing the record-batch, we have to store the bytes in temporary memory. If the size is big, it directly goes into JVM old generation, which will cause FULL GC frequently. I also tried to reuse this temporary memory, but it can only be reusable in file level -- We need to allocate the memory for each file. If the number of intermediate files are big, frequent FULL GC is still inevitable. If the size is small, we will need to store more `batch_entry_numer`(described in point1). 3. Using file level compression will simply the code since CompressedStream is also an Stream, which will not affect the behavior in reading/writing compressed/uncompressed files. 4. After I used file level compression, the GC problem disappeared. Since my cluster has crashed, I didn't get the actual performace enhanced. But seeing from the Carbondata maven tests, the most time consuming module `Spark Common Test` takes less time to complete comparing with uncompressed. Time consumed in `Spark Common Test` module: | Compressor | Time Consumed | | --- | --- | | None | 19:25min | | SNAPPY | 18:38min | | LZ4 | 19:12min | | GZIP | 20:32min | | BZIP2 | 21:10min | In conclusion, I think file level compression is better and I plan to remove the record-batch leve compression related code in Carbondata. > Data load failed when using compressed sort temp file > ----------------------------------------------------- > > Key: CARBONDATA-1839 > URL: https://issues.apache.org/jira/browse/CARBONDATA-1839 > Project: CarbonData > Issue Type: Bug > Reporter: xuchuanyin > Assignee: xuchuanyin > Time Spent: 10h 10m > Remaining Estimate: 0h > > Carbondata provide an option to optimize data load process by compressing the intermediate sort temp files. > The option is `carbon.is.sort.temp.file.compression.enabled` and its default value is `false`. In some disk tense scenario, user can turn on this feature by setting the option `true`, it will compress the file content before writing it to disk. > How ever I have found bugs in the related code and the data load was failed after turning on this feature. > This bug can be reproduced easily. I used the example from `TestLoadDataFrame` Line98. > 1. create a dataframe (e.g. 320000 rows with 3 columns) > 2. set carbon.is.sort.temp.file.compression.enabled=true in CarbonProperities > 3. write the dataframe to a carbontable through dataframewriter > Error messages are shown as below: > ``` > 17/11/29 18:04:12 ERROR SortDataRows: SortDataRowPool:test1 > java.lang.ClassCastException: [B cannot be cast to [Ljava.lang.Integer; > at org.apache.carbondata.core.util.NonDictionaryUtil.getDimension(NonDictionaryUtil.java:93) > at org.apache.carbondata.processing.sort.sortdata.UnCompressedTempSortFileWriter.writeDataOutputStream(UnCompressedTempSortFileWriter.java:52) > at org.apache.carbondata.processing.sort.sortdata.CompressedTempSortFileWriter.writeSortTempFile(CompressedTempSortFileWriter.java:65) > at org.apache.carbondata.processing.sort.sortdata.SortTempFileChunkWriter.writeSortTempFile(SortTempFileChunkWriter.java:72) > at org.apache.carbondata.processing.sort.sortdata.SortDataRows.writeSortTempFile(SortDataRows.java:245) > at org.apache.carbondata.processing.sort.sortdata.SortDataRows.writeDataTofile(SortDataRows.java:232) > at org.apache.carbondata.processing.sort.sortdata.SortDataRows.access$300(SortDataRows.java:45) > at org.apache.carbondata.processing.sort.sortdata.SortDataRows$DataSorterAndWriter.run(SortDataRows.java:426) > at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > ``` > ``` > 17/11/29 18:04:13 ERROR SortDataRows: SafeParallelSorterPool:test1 exception occurred while trying to acquire a semaphore lock: Task org.apache.carbondata.processing.sort.sortdata.SortDataRows$DataSorterAndWriter@3d413b40 rejected from java.util.concurrent.ThreadPoolExecutor@cb56011[Terminated, pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 1] > 17/11/29 18:04:13 ERROR ParallelReadMergeSorterImpl: SafeParallelSorterPool:test1 > org.apache.carbondata.processing.sort.exception.CarbonSortKeyAndGroupByException: > at org.apache.carbondata.processing.sort.sortdata.SortDataRows.addRowBatch(SortDataRows.java:173) > at org.apache.carbondata.processing.loading.sort.impl.ParallelReadMergeSorterImpl$SortIteratorThread.run(ParallelReadMergeSorterImpl.java:227) > at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > Caused by: java.util.concurrent.RejectedExecutionException: Task org.apache.carbondata.processing.sort.sortdata.SortDataRows$DataSorterAndWriter@3d413b40 rejected from java.util.concurrent.ThreadPoolExecutor@cb56011[Terminated, pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 1] > at java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:2047) > at java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:823) > at java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1369) > at org.apache.carbondata.processing.sort.sortdata.SortDataRows.addRowBatch(SortDataRows.java:169) > ... 4 more > ``` > ``` > at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > Caused by: org.apache.carbondata.processing.sort.exception.CarbonSortKeyAndGroupByException: > at org.apache.carbondata.processing.sort.sortdata.SortDataRows.addRowBatch(SortDataRows.java:173) > at org.apache.carbondata.processing.loading.sort.impl.ParallelReadMergeSorterImpl$SortIteratorThread.run(ParallelReadMergeSorterImpl.java:227) > ... 3 more > Caused by: java.util.concurrent.RejectedExecutionException: Task org.apache.carbondata.processing.sort.sortdata.SortDataRows$DataSorterAndWriter@3d413b40 rejected from java.util.concurrent.ThreadPoolExecutor@cb56011[Terminated, pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 1] > at java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:2047) > at java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:823) > at java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1369) > at org.apache.carbondata.processing.sort.sortdata.SortDataRows.addRowBatch(SortDataRows.java:169) > ... 4 more > ``` -- This message was sent by Atlassian JIRA (v6.4.14#64029)