Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 7542B200D4E for ; Thu, 7 Dec 2017 13:16:11 +0100 (CET) Received: by cust-asf.ponee.io (Postfix) id 73DD7160C1E; Thu, 7 Dec 2017 12:16:11 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id B75FA160C0C for ; Thu, 7 Dec 2017 13:16:10 +0100 (CET) Received: (qmail 11040 invoked by uid 500); 7 Dec 2017 12:16:10 -0000 Mailing-List: contact issues-help@carbondata.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@carbondata.apache.org Delivered-To: mailing list issues@carbondata.apache.org Received: (qmail 11030 invoked by uid 99); 7 Dec 2017 12:16:09 -0000 Received: from git1-us-west.apache.org (HELO git1-us-west.apache.org) (140.211.11.23) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 07 Dec 2017 12:16:09 +0000 Received: by git1-us-west.apache.org (ASF Mail Server at git1-us-west.apache.org, from userid 33) id EFA0AE0823; Thu, 7 Dec 2017 12:16:08 +0000 (UTC) From: xuchuanyin To: issues@carbondata.apache.org Reply-To: issues@carbondata.apache.org Message-ID: Subject: [GitHub] carbondata pull request #1632: [CARBONDATA-1839] [DataLoad]Fix bugs in compr... Content-Type: text/plain Date: Thu, 7 Dec 2017 12:16:08 +0000 (UTC) archived-at: Thu, 07 Dec 2017 12:16:11 -0000 GitHub user xuchuanyin opened a pull request: https://github.com/apache/carbondata/pull/1632 [CARBONDATA-1839] [DataLoad]Fix bugs in compressing sort temp files Be sure to do all of the following checklist to help us incorporate your contribution quickly and easily: - [X] Any interfaces changed? `YES, ONLY CHANGE INTERNAL INTERFACES` - [X] Any backward compatibility impacted? `NO` - [X] Document update required? `YES` - [X] Testing done Please provide details on - Whether new unit test cases have been added or why no new tests are required? `ADDED TESTS` - How it is tested? Please attach test report. `TESTED IN LOCAL CLUSTER` - Is it a performance related change? Please attach the performance test report. `YES` - Any additional information to help reviewers in testing this change. `There are some duplicate code in write temp sort files found during this bug fixing and I plan to optimize it in successive PR not in this one.` - [X] For large changes, please consider breaking it into sub-tasks under an umbrella JIRA. `NOT RELATED` RESOLVE === 1. Fix bugs in compressing sort temp file 2. Reduce duplicate code in reading & writing sort temp file and make it more readable 3. Optimize sort procedure: Before: ```flow st=>start: raw row that has been converted(call it 'RawRow' for short) e=>end: write 'PartedRow' to DataFile in write procedure op1=>operation: read RawRow from temp sort file op2=>operation: sort on RawRow op3=>operation: write RawRow to temp sort file cond=>condition: final sort? op4=>operation: sort on RawRow op5=>operation: convert each RawRow to 3 'PartedRow' st->op1->op2->op3->cond cond(no)->op1 cond(yes)->op4->op5->e ``` After: ```flow st=>start: raw row that has been converted(call it 'RawRow' for short) e=>end: write 'PartedRow' to DataFile in write procedure op1=>operation: convert RawRow to 3 'PartedRow' op2=>operation: read PartedRow from temp sort file op3=>operation: sort on PartedRow op4=>operation: write PartedRow to temp sort file cond=>condition: final sort? op5=>operation: sort on PartedRow st->op1->op2->op3->op4->cond cond(no)->op2 cond(yes)->op5->e ``` 4. Add tests to enable sort_temp_file_compressed while doing data loading You can merge this pull request into a Git repository by running: $ git pull https://github.com/xuchuanyin/carbondata bug_sort_temp_compress_1207 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/carbondata/pull/1632.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #1632 ---- commit fb46e1288ae3150700a6508298f1ec9dcc8d37c2 Author: xuchuanyin Date: 2017-12-07T08:31:58Z Fix bugs in compressing sort temp file 1. fix bugs in compressing sort temp file 2. reduce duplicate code in reading & writing sort temp file and make it more readable 3. optimize sort procedure: Before: raw row that has been converted(call it 'RawRow' for short) -> sort on RawRow -> write RawRow to temp sort file -> read RawRow from temp sort file -> sort on RawRow -> ... -> at the final sort, sort on RawRow and convert the RawRow to 3 'PartedRow' -> write 'PartedRow' to DataFile in write procedure. After: raw row that has been converted(call it 'RawRow' for short) -> convert RawRow to 3 'PartedRow' -> sort on PartedRow -> write PartedRow to temp sort file -> read PartedRow from temp sort file -> sort on PartedRow -> ... -> at the final sort, sort on PartedRow -> write 'PartedRow' to DataFile in write procedure. 4. add tests ---- ---