carbondata-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "xuchuanyin (JIRA)" <>
Subject [jira] [Created] (CARBONDATA-1373) Enhance update performance in carbondata
Date Fri, 11 Aug 2017 12:53:00 GMT
xuchuanyin created CARBONDATA-1373:

             Summary: Enhance update performance in carbondata
                 Key: CARBONDATA-1373
             Project: CarbonData
          Issue Type: Improvement
          Components: data-load
            Reporter: xuchuanyin
            Assignee: xuchuanyin
             Fix For: 1.2.0

# Scenario

Recently I have tested the update feature provided in Carbondata and found its poor performance.

I had a table containing about 14 million records with about 370 columns(no dictionary columns)
and the data files are about 3.8 GB in total. All the data files were in one segment.

I performed an update SQL which update a column for all the records and the SQL looked like
`UPDATE myTable SET (col1)=(col1+1000) WHERE TRUE`. In my environment, the update job failed
with 'executor lost errors'. And I found 'spill data' related messages in the container logs.

# Analyze
I've read about the implementation of update-delete in Carbondata in ISSUE#440. The update
consists a delete and an insert operation. And the error occurred during the insert operation.

After studying the code, I have found that while doing inserting, the updated records are
grouped by the `segmentId`, which means all the recoreds in one segment will be processed
in only one task, thus will cause task failure when the amount of input data is quite large.

# Solution
We should improve the parallelism when doing update for a segment.

I append a random key to the `segmentId` to increase the partition number before doing the
insertion stage and then remove the suffix when doing the real insertion.

I have tested in my example and the job finished in about 13 minutes successfully. The records
were updated as expected.

This message was sent by Atlassian JIRA

View raw message