cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Bhuvan Rawal <bhu1ra...@gmail.com>
Subject Re: Hi Memory consumption with Copy command
Date Sat, 23 Apr 2016 10:00:02 GMT
I built cython and disabled bundled driver, the performance has been
impressive. Memory issue is resolved and Im currently getting around
100,000 rows per second, its stressing both the client CPU as well as
cassandra nodes. Thats the fastest I have ever seen it perform. With 60
Million rows already transferred in ~5 Minutes.

Just a final question before we close this thread, at this performance
level would you recommend sstable loader or copy command?

On Sat, Apr 23, 2016 at 2:00 PM, Bhuvan Rawal <bhu1rawal@gmail.com> wrote:

> Thanks Stefania for the informative answer.  The next blog was pretty
> useful as well:
> http://www.datastax.com/dev/blog/how-we-optimized-cassandra-cqlsh-copy-from
> . Ill upgrade to 3.0.5 and test with C extensions enabled and report on
> this thread.
>
> On Sat, Apr 23, 2016 at 8:54 AM, Stefania Alborghetti <
> stefania.alborghetti@datastax.com> wrote:
>
>> Hi Bhuvan
>>
>> Support for large datasets in COPY FROM was added by CASSANDRA-11053
>> <https://issues.apache.org/jira/browse/CASSANDRA-11053>, which is
>> available in 2.1.14, 2.2.6, 3.0.5 and 3.5. Your scenario is valid with this
>> patch applied.
>>
>> The 3.0.x and 3.x releases are already available, whilst the other two
>> releases are due in the next few days. You only need to install an
>> up-to-date release on the machine where COPY FROM is running.
>>
>> You may find the setup instructions in this blog
>> <http://www.datastax.com/dev/blog/six-parameters-affecting-cqlsh-copy-from-performance>
>> interesting. Specifically, for large datasets, I would highly recommend
>> installing the Python driver with C extensions, as it will speed things up
>> considerably. Again, this is only possible with the 11053 patch. Please
>> ignore the suggestion to also compile the cqlsh copy module itself with C
>> extensions (Cython), as you may hit CASSANDRA-11574
>> <https://issues.apache.org/jira/browse/CASSANDRA-11574> in the 3.0.5 and
>> 3.5 releases.
>>
>> Before CASSANDRA-11053, the parent process was a bottleneck. This is
>> explained further in  this blog
>> <http://www.datastax.com/dev/blog/how-we-optimized-cassandra-cqlsh-copy-from>,
>> second paragraph in the "worker processes" section. As a workaround, if you
>> are unable to upgrade, you may try reducing the INGESTRATE and introducing
>> a few extra worker processes via NUMPROCESSES. Also, the parent process is
>> overloaded and is therefore not able to report progress correctly.
>> Therefore, if the progress report is frozen, it doesn't mean the COPY
>> OPERATION is not making progress.
>>
>> Do let us know if you still have problems, as this is new functionality.
>>
>> With best regards,
>> Stefania
>>
>>
>> On Sat, Apr 23, 2016 at 6:34 AM, Bhuvan Rawal <bhu1rawal@gmail.com>
>> wrote:
>>
>>> Hi,
>>>
>>> Im trying to copy a 20 GB CSV file into a 3 node fresh cassandra cluster
>>> with 32 GB memory each, sufficient disk, RF-1 and durable write false. The
>>> machine im feeding into is external to the cluster and shares 1GBps line
>>> and has 16 GB RAM. (We have chosen this setup to possibly reduce CPU and IO
>>> usage).
>>>
>>> Im trying to use COPY command to feed in data. It kicks off well,
>>> launches a set of processes, does about 50,000 rows per second. But I can
>>> see that the parent process starts aggregating memory almost of the size of
>>> data processed and after a point the processes just hang. The parent
>>> process was consuming 95% system memory when it had processed around 60%
>>> data.
>>>
>>> I had earlier tried to feed in data from multiple files (Less than 4GB
>>> each) and it was working as expected.
>>>
>>> Is it a valid scenario?
>>>
>>> Regards,
>>> Bhuvan
>>>
>>
>>
>>
>> --
>>
>>
>> [image: datastax_logo.png] <http://www.datastax.com/>
>>
>> Stefania Alborghetti
>>
>> Apache Cassandra Software Engineer
>>
>> |+852 6114 9265| stefania.alborghetti@datastax.com
>>
>>
>> [image: cassandrasummit.org/Email_Signature]
>> <http://cassandrasummit.org/Email_Signature>
>>
>
>

Mime
View raw message