carbondata-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From xuchuanyin <...@git.apache.org>
Subject [GitHub] carbondata pull request #1253: [CARBONDATA-1373] Enhance update performance ...
Date Fri, 11 Aug 2017 15:39:43 GMT
GitHub user xuchuanyin opened a pull request:

    https://github.com/apache/carbondata/pull/1253

    [CARBONDATA-1373] Enhance update performance by increasing parallelism

    # Scenario
    
    Recently I have tested the update feature provided in Carbondata and found its poor performance.
    
    I had a table containing about 14 million records with about 370 columns(no dictionary
columns) and the data files are about 3.8 GB in total. All the data files were in one segment.
    
    I performed an update SQL which update a column for all the records and the SQL looked
like `UPDATE myTable SET (col1)=(col1+1000) WHERE TRUE`. In my environment, the update job
failed with 'executor lost errors'. And I found 'spill data' related messages in the container
logs.
    
    # Analyze
    
    I've read about the implementation of update-delete in Carbondata in ISSUE#440. The update
consists a delete and an insert operation. And the error occurred during the insert operation.
    
    After studying the code, I have found that while doing inserting, the updated records
are grouped by the `segmentId`, which means all the recoreds in one segment will be processed
in only one task, thus will cause task failure when the amount of input data is quite large.
    
    # Solution
    
    We should improve the parallelism when doing update for a segment.
    
    I append a random key to the `segmentId` to increase the partition number before doing
the insertion stage and then remove the suffix when doing the real insertion.
    
    # Modification
    
    + Increase parallelism while processing one segment in update
    + Add a property to configure the parallelism
    + Clean up local files after update (previous bugs)
    
    # Notes
    
    I have tested in my example and the job finished in about 13 minutes successfully. The
records were updated as expected.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/xuchuanyin/carbondata enhance_update

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/carbondata/pull/1253.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #1253
    
----
commit 84a4fe5e9afcd127cf976cda55bc6bfd8faf4e32
Author: xuchuanyin <xuchuanyin@hust.edu.cn>
Date:   2017-08-11T15:00:20Z

    Enhance update performance by increasing parallelism
    
    + Increase parallelism while processing one segment in update
    + Add a property to configure the parallelism
    + Clean up local files after update (previous bugs)

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

Mime
View raw message