carbondata-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From xuchuanyin <...@git.apache.org>
Subject [GitHub] carbondata pull request #1632: [CARBONDATA-1839] [DataLoad]Fix bugs in compr...
Date Thu, 07 Dec 2017 12:16:08 GMT
GitHub user xuchuanyin opened a pull request:

    https://github.com/apache/carbondata/pull/1632

    [CARBONDATA-1839] [DataLoad]Fix bugs in compressing sort temp files

    Be sure to do all of the following checklist to help us incorporate 
    your contribution quickly and easily:
    
     - [X] Any interfaces changed?
      `YES, ONLY CHANGE INTERNAL INTERFACES`
     - [X] Any backward compatibility impacted?
      `NO`
     - [X] Document update required?
      `YES`
     - [X] Testing done
            Please provide details on 
            - Whether new unit test cases have been added or why no new tests are required?
            `ADDED TESTS`
            - How it is tested? Please attach test report.
            `TESTED IN LOCAL CLUSTER`
            - Is it a performance related change? Please attach the performance test report.
            `YES`
            - Any additional information to help reviewers in testing this change.
            `There are some duplicate code in write temp sort files found during this bug
fixing and I plan to optimize it in successive PR not in this one.`
     - [X] For large changes, please consider breaking it into sub-tasks under an umbrella
JIRA. 
            `NOT RELATED`
    
    RESOLVE
    ===
    
    1. Fix bugs in compressing sort temp file
    
    2. Reduce duplicate code in reading & writing sort temp file
     and make it more readable
    
    3. Optimize sort procedure:
    
    Before:
    
    ```flow
    st=>start: raw row that has been converted(call it 'RawRow' for short)
    e=>end: write 'PartedRow' to DataFile in write procedure
    op1=>operation: read RawRow from temp sort file
    op2=>operation: sort on RawRow
    op3=>operation: write RawRow to temp sort file
    cond=>condition: final sort?
    op4=>operation: sort on RawRow
    op5=>operation: convert each RawRow to 3 'PartedRow'
    
    st->op1->op2->op3->cond
    cond(no)->op1
    cond(yes)->op4->op5->e
    ```
    After´╝Ü
    
    ```flow
    st=>start: raw row that has been converted(call it 'RawRow' for short)
    e=>end: write 'PartedRow' to DataFile in write procedure
    op1=>operation: convert RawRow to 3 'PartedRow'
    op2=>operation: read PartedRow from temp sort file
    op3=>operation: sort on PartedRow
    op4=>operation: write PartedRow to temp sort file
    cond=>condition: final sort?
    op5=>operation: sort on PartedRow
    
    st->op1->op2->op3->op4->cond
    cond(no)->op2
    cond(yes)->op5->e
    ```
    
    4. Add tests to enable sort_temp_file_compressed while doing data loading

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/xuchuanyin/carbondata bug_sort_temp_compress_1207

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/carbondata/pull/1632.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #1632
    
----
commit fb46e1288ae3150700a6508298f1ec9dcc8d37c2
Author: xuchuanyin <xuchuanyin@huawei.com>
Date:   2017-12-07T08:31:58Z

    Fix bugs in compressing sort temp file
    
    1. fix bugs in compressing sort temp file
    
    2. reduce duplicate code in reading & writing sort temp file
     and make it more readable
    
    3. optimize sort procedure:
    
    Before:
     raw row that has been converted(call it 'RawRow' for short) ->
     sort on RawRow ->
     write RawRow to temp sort file ->
     read RawRow from temp sort file ->
     sort on RawRow -> ... ->
     at the final sort, sort on RawRow and convert the RawRow to 3 'PartedRow' ->
     write 'PartedRow' to DataFile in write procedure.
    
    After:
     raw row that has been converted(call it 'RawRow' for short) ->
     convert RawRow to 3 'PartedRow' ->
     sort on PartedRow ->
     write PartedRow to temp sort file ->
     read PartedRow from temp sort file ->
     sort on PartedRow -> ... ->
     at the final sort, sort on PartedRow ->
     write 'PartedRow' to DataFile in write procedure.
    
    4. add tests

----


---

Mime
View raw message