carbondata-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "xuchuanyin (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (CARBONDATA-1281) Disk hotspot found during data loading
Date Mon, 10 Jul 2017 06:12:01 GMT

     [ https://issues.apache.org/jira/browse/CARBONDATA-1281?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

xuchuanyin updated CARBONDATA-1281:
-----------------------------------
    Description: 
# Scenario

Currently we have done a massive data loading. The input data is about 71GB in CSV format,and
have about 88million records. When using carbondata, we do not use any dictionary encoding.
Our testing environment has three nodes and each of them have 11 disks as yarn executor directory.
We submit the loading command through JDBCServer.The JDBCServer instance have three executors
in total, one on each node respectively. The loading takes about 10minutes (+-3min vary from
each time).

We have observed the nmon information during the loading and find:

1. lots of CPU waits in the first half of loading;

2. only one single disk has many writes and almost reaches its bottleneck (Avg. 80M/s, Max.
150M/s on SAS Disk)

3. the other disks are quite idel

# Analyze

When do data loading, carbondata read and sort data locally(default scope) and write the temp
files to local disk. In my case, there is only one executor in one node, so carbondata write
all the temp file to one disk(container directory or yarn local directory), thus resulting
into single disk hotspot.

# Modification

We should support multiple directory for writing temp files to avoid disk hotspot.

Ps: I have improved this in my environment and the result is pretty optimistic: the loading
takes about 6minutes (10 minutes before improving).

  was:
# Scenario

Currently we have done a massive data loading. The input data is about 71GB in CSV format,and
have about 88million records. When using carbondata, we do not use any dictionary encoding.
Our testing environment has three nodes and each of them have 11 disks as yarn executor directory.
We submit the loading command through JDBCServer.The JDBCServer instance have three executors
in total, one on each node respectively. The loading takes about 10minutes (+-3min vary from
each time).

We have observed the nmon information during the loading and find:

1. lots of CPU waits in the first half of loading;

2. only one single disk has many writes and almost reaches its bottleneck (Avg. 80M/s, Max.
150M/s on SAS Disk)

3. the other disks are quite idel

# Analyze

When do data loading, carbondata read and sort data locally(default scope) and write the temp
files to local disk. In my case, there is only one executor in one node, so carbondata write
all the temp file to one disk(container directory or yarn local directory), thus resulting
into single disk hotspot.

# Modification

We should support multiple directory for writing temp files to avoid disk hotspot.

Ps: I have improve this in my environment and the result is pretty optimistic: the loading
takes about 6minutes (10 minutes before improving).


> Disk hotspot found during data loading
> --------------------------------------
>
>                 Key: CARBONDATA-1281
>                 URL: https://issues.apache.org/jira/browse/CARBONDATA-1281
>             Project: CarbonData
>          Issue Type: Improvement
>          Components: core, data-load
>    Affects Versions: 1.1.0
>            Reporter: xuchuanyin
>
> # Scenario
> Currently we have done a massive data loading. The input data is about 71GB in CSV format,and
have about 88million records. When using carbondata, we do not use any dictionary encoding.
Our testing environment has three nodes and each of them have 11 disks as yarn executor directory.
We submit the loading command through JDBCServer.The JDBCServer instance have three executors
in total, one on each node respectively. The loading takes about 10minutes (+-3min vary from
each time).
> We have observed the nmon information during the loading and find:
> 1. lots of CPU waits in the first half of loading;
> 2. only one single disk has many writes and almost reaches its bottleneck (Avg. 80M/s,
Max. 150M/s on SAS Disk)
> 3. the other disks are quite idel
> # Analyze
> When do data loading, carbondata read and sort data locally(default scope) and write
the temp files to local disk. In my case, there is only one executor in one node, so carbondata
write all the temp file to one disk(container directory or yarn local directory), thus resulting
into single disk hotspot.
> # Modification
> We should support multiple directory for writing temp files to avoid disk hotspot.
> Ps: I have improved this in my environment and the result is pretty optimistic: the loading
takes about 6minutes (10 minutes before improving).



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Mime
View raw message