cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "xiangdong Huang (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (CASSANDRA-13446) CQLSSTableWriter takes 100% CPU when the buffer_size_in_mb is larger than 64MB
Date Thu, 13 Apr 2017 13:50:41 GMT

     [ https://issues.apache.org/jira/browse/CASSANDRA-13446?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

xiangdong Huang updated CASSANDRA-13446:
----------------------------------------
    Description: 
I want to use CQLSSTableWriter to load large amounts of data as SSTables, however the CPU
cost and the speed is not good.
```
CQLSSTableWriter writer = CQLSSTableWriter.builder()
                .inDirectory(new File("output"+j))
                .forTable(SCHEMA)
                .withBufferSizeInMB(Integer.parseInt(System.getProperty("buffer_size_in_mb",
"256")))//FIXME!! if the size is 64, it is ok, if it is 128 or larger, boom!!
                .using(INSERT_STMT)
                .withPartitioner(new Murmur3Partitioner()).build();
```
if the `buffer_size_in_mb` is less than 64MB in my  PC, everything is ok: the CPU utilization
is about 60% and the memory is about 3GB (why 3GB? Luckly, I can bear that...).  The process
creates 24MB per sstable (I think it is because sstable compresses data) one by one.

However, if the `buffer_size_in_mb` is greater, e.g., 128MB on my PC,  The CPU utilization
is about 70%, the memory is still about 3GB.
When the CQLSSTableWriter receives 128MB data, it begins to flush data as a sstable. At this
time, the bad thing comes:
CQLSSTableWriter.addRow() becomes very slow, and!! NO SSTABLE IS WRITTEN. Windows task manager
shows the disk I/O for this process is 0.0 MB/s.  There is no file appears in the output folder
(Sometimes a _zero-KB mc-1-big-Data.db_ and a _zero-KB mc-1-big-Index.db_ appear, and some
transaction log file comes and disappears..). At this time, the process spends 99% CPU! and
the memory is a little larger than 3GB....
Long long time later, the process crashes because of "GC overhead...", and there is still
no sstable file built.....

When I use jprofile 10 to check who uses so much CPU, it says CQLSSTableWriter.addRow() takes
about 99% CPU....

I have no idea to optimize the process, because Cassandra's SStable writing process is so
complex...

The important thing is, 64MB buffer size is too small in production environments: it creates
many 24MB SSTables, but we want a large sstable which can hold all the data in the batch load
process. 

Now I wonder whether Spark and MapReduce work well with Cassandra, because when I have a glance
of the source code, I notice that they also use CQLSSTableWriter to save output data....

The  cassandra version is 3.10. The datastax driver (for typec) is 3.2.0.

The attachment is my test program and the csv data. 
A complete test program can be found from: https://bitbucket.org/jixuan1989/csv2sstable


 

  was:
I want to use CQLSSTableWriter to load large amounts of data as SSTables, however the CPU
cost and the speed is not good.
```java
CQLSSTableWriter writer = CQLSSTableWriter.builder()
                .inDirectory(new File("output"+j))
                .forTable(SCHEMA)
                .withBufferSizeInMB(Integer.parseInt(System.getProperty("buffer_size_in_mb",
"256")))//FIXME!! if the size is 64, it is ok, if it is 128 or larger, boom!!
                .using(INSERT_STMT)
                .withPartitioner(new Murmur3Partitioner()).build();
```
if the `buffer_size_in_mb` is less than 64MB in my  PC, everything is ok: the CPU utilization
is about 60% and the memory is about 3GB (why 3GB? Luckly, I can bear that...).  The process
creates 24MB per sstable (I think it is because sstable compresses data) one by one.

However, if the `buffer_size_in_mb` is greater, e.g., 128MB on my PC,  The CPU utilization
is about 70%, the memory is still about 3GB.
When the CQLSSTableWriter receives 128MB data, it begins to flush data as a sstable. At this
time, the bad thing comes:
CQLSSTableWriter.addRow() becomes very slow, and!! NO SSTABLE IS WRITTEN. Windows task manager
shows the disk I/O for this process is 0.0 MB/s.  There is no file appears in the output folder
(Sometimes a _zero-KB mc-1-big-Data.db_ and a _zero-KB mc-1-big-Index.db_ appear, and some
transaction log file comes and disappears..). At this time, the process spends 99% CPU! and
the memory is a little larger than 3GB....
Long long time later, the process crashes because of "GC overhead...", and there is still
no sstable file built.....

When I use jprofile 10 to check who uses so much CPU, it says CQLSSTableWriter.addRow() takes
about 99% CPU....

I have no idea to optimize the process, because Cassandra's SStable writing process is so
complex...

The important thing is, 64MB buffer size is too small in production environments: it creates
many 24MB SSTables, but we want a large sstable which can hold all the data in the batch load
process. 

Now I wonder whether Spark and MapReduce work well with Cassandra, because when I have a glance
of the source code, I notice that they also use CQLSSTableWriter to save output data....

The  cassandra version is 3.10. The datastax driver (for typec) is 3.2.0.

The attachment is my test program and the csv data. 
A complete test program can be found from: https://bitbucket.org/jixuan1989/csv2sstable


 


> CQLSSTableWriter takes 100% CPU when the buffer_size_in_mb is larger than 64MB 
> -------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-13446
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-13446
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Tools
>         Environment: Windows 10, 8GB memory, i7 CPU
>            Reporter: xiangdong Huang
>         Attachments: csv2sstable.java, pom.xml, test.csv
>
>
> I want to use CQLSSTableWriter to load large amounts of data as SSTables, however the
CPU cost and the speed is not good.
> ```
> CQLSSTableWriter writer = CQLSSTableWriter.builder()
>                 .inDirectory(new File("output"+j))
>                 .forTable(SCHEMA)
>                 .withBufferSizeInMB(Integer.parseInt(System.getProperty("buffer_size_in_mb",
"256")))//FIXME!! if the size is 64, it is ok, if it is 128 or larger, boom!!
>                 .using(INSERT_STMT)
>                 .withPartitioner(new Murmur3Partitioner()).build();
> ```
> if the `buffer_size_in_mb` is less than 64MB in my  PC, everything is ok: the CPU utilization
is about 60% and the memory is about 3GB (why 3GB? Luckly, I can bear that...).  The process
creates 24MB per sstable (I think it is because sstable compresses data) one by one.
> However, if the `buffer_size_in_mb` is greater, e.g., 128MB on my PC,  The CPU utilization
is about 70%, the memory is still about 3GB.
> When the CQLSSTableWriter receives 128MB data, it begins to flush data as a sstable.
At this time, the bad thing comes:
> CQLSSTableWriter.addRow() becomes very slow, and!! NO SSTABLE IS WRITTEN. Windows task
manager shows the disk I/O for this process is 0.0 MB/s.  There is no file appears in the
output folder (Sometimes a _zero-KB mc-1-big-Data.db_ and a _zero-KB mc-1-big-Index.db_ appear,
and some transaction log file comes and disappears..). At this time, the process spends 99%
CPU! and the memory is a little larger than 3GB....
> Long long time later, the process crashes because of "GC overhead...", and there is still
no sstable file built.....
> When I use jprofile 10 to check who uses so much CPU, it says CQLSSTableWriter.addRow()
takes about 99% CPU....
> I have no idea to optimize the process, because Cassandra's SStable writing process is
so complex...
> The important thing is, 64MB buffer size is too small in production environments: it
creates many 24MB SSTables, but we want a large sstable which can hold all the data in the
batch load process. 
> Now I wonder whether Spark and MapReduce work well with Cassandra, because when I have
a glance of the source code, I notice that they also use CQLSSTableWriter to save output data....
> The  cassandra version is 3.10. The datastax driver (for typec) is 3.2.0.
> The attachment is my test program and the csv data. 
> A complete test program can be found from: https://bitbucket.org/jixuan1989/csv2sstable
>  



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Mime
View raw message