cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Stefania Alborghetti <>
Subject Re: Hi Memory consumption with Copy command
Date Sun, 24 Apr 2016 04:35:46 GMT
That's really excellent! Thank you so much for sharing the results.

Regarding sstableloader, I am not familiar with its performance so I cannot
make any recommendation as I've never compared it with COPY FROM.

I have however compared COPY FROM with another bulk import tool,
cassandra-loader, <>
during the tests for CASSANDRA-11053. COPY FROM should now be as efficient
as this tool if not better (depending on data sets and test environment).

There is then this presentation
<>, from
Cassandra Summit 2015, where it compares sstableloader, cassandra-loader
and the "old" COPY FROM. According to the results at slide #18,
sstableloader is slightly better than cassandra-loader for small records,
then the sstableloader performance decreases as the record size increases.

So my guess would be that sstableloader may or may not better, depending on
the record size. If it is better, I would think that the difference should
be minimal.  Sorry this is not very accurate but that's the best I have.

On Sat, Apr 23, 2016 at 6:00 PM, Bhuvan Rawal <> wrote:

> I built cython and disabled bundled driver, the performance has been
> impressive. Memory issue is resolved and Im currently getting around
> 100,000 rows per second, its stressing both the client CPU as well as
> cassandra nodes. Thats the fastest I have ever seen it perform. With 60
> Million rows already transferred in ~5 Minutes.
> Just a final question before we close this thread, at this performance
> level would you recommend sstable loader or copy command?
> On Sat, Apr 23, 2016 at 2:00 PM, Bhuvan Rawal <> wrote:
>> Thanks Stefania for the informative answer.  The next blog was pretty
>> useful as well:
>> . Ill upgrade to 3.0.5 and test with C extensions enabled and report on
>> this thread.
>> On Sat, Apr 23, 2016 at 8:54 AM, Stefania Alborghetti <
>>> wrote:
>>> Hi Bhuvan
>>> Support for large datasets in COPY FROM was added by CASSANDRA-11053
>>> <>, which is
>>> available in 2.1.14, 2.2.6, 3.0.5 and 3.5. Your scenario is valid with this
>>> patch applied.
>>> The 3.0.x and 3.x releases are already available, whilst the other two
>>> releases are due in the next few days. You only need to install an
>>> up-to-date release on the machine where COPY FROM is running.
>>> You may find the setup instructions in this blog
>>> <>
>>> interesting. Specifically, for large datasets, I would highly recommend
>>> installing the Python driver with C extensions, as it will speed things up
>>> considerably. Again, this is only possible with the 11053 patch. Please
>>> ignore the suggestion to also compile the cqlsh copy module itself with C
>>> extensions (Cython), as you may hit CASSANDRA-11574
>>> <> in the 3.0.5
>>> and 3.5 releases.
>>> Before CASSANDRA-11053, the parent process was a bottleneck. This is
>>> explained further in  this blog
>>> <>,
>>> second paragraph in the "worker processes" section. As a workaround, if you
>>> are unable to upgrade, you may try reducing the INGESTRATE and introducing
>>> a few extra worker processes via NUMPROCESSES. Also, the parent process is
>>> overloaded and is therefore not able to report progress correctly.
>>> Therefore, if the progress report is frozen, it doesn't mean the COPY
>>> OPERATION is not making progress.
>>> Do let us know if you still have problems, as this is new functionality.
>>> With best regards,
>>> Stefania
>>> On Sat, Apr 23, 2016 at 6:34 AM, Bhuvan Rawal <>
>>> wrote:
>>>> Hi,
>>>> Im trying to copy a 20 GB CSV file into a 3 node fresh cassandra
>>>> cluster with 32 GB memory each, sufficient disk, RF-1 and durable write
>>>> false. The machine im feeding into is external to the cluster and shares
>>>> 1GBps line and has 16 GB RAM. (We have chosen this setup to possibly reduce
>>>> CPU and IO usage).
>>>> Im trying to use COPY command to feed in data. It kicks off well,
>>>> launches a set of processes, does about 50,000 rows per second. But I can
>>>> see that the parent process starts aggregating memory almost of the size
>>>> data processed and after a point the processes just hang. The parent
>>>> process was consuming 95% system memory when it had processed around 60%
>>>> data.
>>>> I had earlier tried to feed in data from multiple files (Less than 4GB
>>>> each) and it was working as expected.
>>>> Is it a valid scenario?
>>>> Regards,
>>>> Bhuvan
>>> --
>>> [image: datastax_logo.png] <>
>>> Stefania Alborghetti
>>> Apache Cassandra Software Engineer
>>> |+852 6114 9265|
>>> [image:]
>>> <>


[image: datastax_logo.png] <>

Stefania Alborghetti

Apache Cassandra Software Engineer

|+852 6114 9265|


View raw message