Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 0FDB7200BBD for ; Tue, 8 Nov 2016 11:15:11 +0100 (CET) Received: by cust-asf.ponee.io (Postfix) id 0E626160B0A; Tue, 8 Nov 2016 10:15:11 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id D1145160AFA for ; Tue, 8 Nov 2016 11:15:09 +0100 (CET) Received: (qmail 51292 invoked by uid 500); 8 Nov 2016 10:15:08 -0000 Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@cassandra.apache.org Delivered-To: mailing list user@cassandra.apache.org Received: (qmail 51282 invoked by uid 99); 8 Nov 2016 10:15:08 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd3-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 08 Nov 2016 10:15:08 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd3-us-west.apache.org (ASF Mail Server at spamd3-us-west.apache.org) with ESMTP id E404418002D for ; Tue, 8 Nov 2016 10:15:07 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd3-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 1.998 X-Spam-Level: * X-Spam-Status: No, score=1.998 tagged_above=-999 required=6.31 tests=[HTML_MESSAGE=2, RCVD_IN_DNSWL_NONE=-0.0001, RCVD_IN_MSPIKE_H2=-0.001, SPF_HELO_PASS=-0.001, SPF_PASS=-0.001, URIBL_BLOCKED=0.001] autolearn=disabled Received: from mx1-lw-us.apache.org ([10.40.0.8]) by localhost (spamd3-us-west.apache.org [10.40.0.10]) (amavisd-new, port 10024) with ESMTP id qQF_g-9uA382 for ; Tue, 8 Nov 2016 10:15:03 +0000 (UTC) Received: from mail1.bemta5.messagelabs.com (mail1.bemta5.messagelabs.com [195.245.231.148]) by mx1-lw-us.apache.org (ASF Mail Server at mx1-lw-us.apache.org) with ESMTPS id CF00360DF5 for ; Tue, 8 Nov 2016 10:15:02 +0000 (UTC) Received: from [85.158.139.211] by server-12.bemta-5.messagelabs.com id 4C/FD-27365-E95A1285; Tue, 08 Nov 2016 10:14:54 +0000 X-Brightmail-Tracker: H4sIAAAAAAAAA+NgFvrEIsWRWlGSWpSXmKPExsVyyOvRCt25SxU jDGZskbQ49+4fswOjx7dr39gDGKNYM/OS8isSWDM2HZ7NWvB4ImNF87IjbA2MdxsYuxi5OIQE 9jJKzL/9nbmLkZODTcBG4tierUwgtoiAtcSBhQ9YQGxhAV+JGW9usnYxcgDF/SUWvLOFKLGS+ H/8BTNImEVAReJ3exhImFfAT2LJ9NesILaQwGRGiS+3gkBsToFAiT8P2sHijAKyEl8aV4NtZR YQl2j6shIsLiEgILFkz3lmCFtU4uXjf1BxXYmrT3eyQdTnSdzfu4sFYpegxMmZT1gmMArOQjJ qFpKyWUjKIOJ6EjemTmGDsLUlli18zQxh60rM+HeIBVl8ASP7KkaN4tSistQiXSMjvaSizPSM ktzEzBxdQwNTvdzU4uLE9NScxKRiveT83E2MwLioZ2Bg3MG4p93vEKMkB5OSKO/7AsUIIb6k/ JTKjMTijPii0pzU4kOMMhwcShK8RUuAcoJFqempFWmZOcAIhUlLcPAoifBqg6R5iwsSc4sz0y FSpxgVpcR5i0ESAiCJjNI8uDZYUrjEKCslzMvIwMAgxFOQWpSbWYIq/4pRnINRSZi3HGQKT2Z eCdz0V0CLmYAWV8UogCwuSURISTUwdjOv8Jt/jPuA4/kL6a5vjSrrtz33jjV2E2S3zmyan9B4 Q36qme49l5O/rNlu3Hz6ZZvOkbZXmSvkbi2utZQ6q9oSf7iwb4nIc968OfMN1qUp/XYLmHVKz irKp2Gbla7Eg6j7TyWv/ZE/Ui8t33pNRv9lfdE3wZRJTbNNQjfYpQk1SB16Lf9CiaU4I9FQi7 moOBEAUM0L8wUDAAA= X-Env-Sender: Rajesh.Radhakrishnan@phe.gov.uk X-Msg-Ref: server-16.tower-206.messagelabs.com!1478600090!53227307!3 X-Originating-IP: [194.74.226.168] X-StarScan-Received: X-StarScan-Version: 9.0.13; banners=phe.gov.uk,-,- X-VirusChecked: Checked Received: (qmail 41747 invoked from network); 8 Nov 2016 10:14:53 -0000 Received: from mail3.hpa.org.uk (HELO MAILHUBCOL02.phe.gov.uk) (194.74.226.168) by server-16.tower-206.messagelabs.com with AES256-SHA encrypted SMTP; 8 Nov 2016 10:14:53 -0000 Received: from MAILMBXCOL02.phe.gov.uk ([fe80::f128:b07d:46ab:bfa]) by MAILHUBCOL02.phe.gov.uk ([fe80::51fd:529a:9dca:ff29%14]) with mapi id 14.03.0279.002; Tue, 8 Nov 2016 10:14:51 +0000 From: Rajesh Radhakrishnan To: "user@cassandra.apache.org" Subject: RE: Cassandra Python Driver : execute_async consumes lots of memory? Thread-Topic: Cassandra Python Driver : execute_async consumes lots of memory? Thread-Index: AdI5FpkLYmrwpD0BTzakZc++bSaXnwAAy5iAACO6Ojc= Date: Tue, 8 Nov 2016 10:14:51 +0000 Message-ID: <0A9C05DECDEB6A4FAF7A3783EECB7800686ED358@MAILMBXCOL02.phe.gov.uk> References: <0A9C05DECDEB6A4FAF7A3783EECB7800686ED292@MAILMBXCOL02.phe.gov.uk>, In-Reply-To: Accept-Language: en-GB, en-US Content-Language: en-GB X-MS-Has-Attach: X-MS-TNEF-Correlator: x-originating-ip: [2002:9e77:4117::9e77:4117] Content-Type: multipart/alternative; boundary="_000_0A9C05DECDEB6A4FAF7A3783EECB7800686ED358MAILMBXCOL02phe_" MIME-Version: 1.0 archived-at: Tue, 08 Nov 2016 10:15:11 -0000 --_000_0A9C05DECDEB6A4FAF7A3783EECB7800686ED358MAILMBXCOL02phe_ Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable Hi Lahiru, Great! you know what, REDUCTION of BATCH size from 50 to 20 solved my issu= e. Thank you very much. Good job man! and Memory issue solved. Next I will try using Spark to speed it up. Kind regards, Rajesh Radhakrishnan ________________________________ From: Lahiru Gamathige [lahiru@highfive.com] Sent: 07 November 2016 17:10 To: user@cassandra.apache.org Subject: Re: Cassandra Python Driver : execute_async consumes lots of memo= ry? Hi Rajesh, By looking at your code I see that the memory would definitely grow becaus= e you write big batches async and you will end up large number of batch st= atements and the all end up slowing down. We recently migrated some data t= o C* and what we did was we created a data stream and wrote in batches and= used a library which is sensitive to back-pressure of the stream. In your= implementation there's is no back-pressure to control it. We migrated dat= a pretty fast by keeping the CPU 100% constantly and achieve the highest p= erformance (used Scala with akka-streams and phantom-websudo). I would consider using some streaming API to implement this. When you do b= atching make sure you don't exceed the max match size, then things will sl= ow down anyways. Lahiru On Mon, Nov 7, 2016 at 8:51 AM, Rajesh Radhakrishnan > wrote: Hi We are trying to inject millions to data into a table by executing Batches= of PreparedStatments. We found that when we use 'session.execute(batch)', it write more data but= very very slow. However if we use 'session.execute_async(batch)' then its relatively fast= but when it reaches certain limit, its fillup the memory (python process)= Our implementation: Cassandra 3.7.0 cluster ring with 3 nodes (RedHat, 150GB Disk, 8GB of RAM= each) Python 2.7.12 Anybody know how to reduce the memory use of Cassandra-python driver API s= pecifically for execute_async? Thank you! =3D=3D=3DCODE =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D sqlQuery =3D "INSERT INTO tableV (id, sample_name, pos, ref_base, v= ar_base) values (?,?,?,?,?)" random_numbers_for_strains =3D random.sample(xrange(1,300), 200) random_numbers =3D random.sample(xrange(1,2000000), 200000) totalCounter =3D 0 c =3D 0 time_init =3D time.time() for random_number_strain in random_numbers_for_strains: sample_name =3D None sample_name =3D 'sample'+str(random_number_strain) cassandraCluster =3D CassandraCluster.CassandraCluster() cluster =3D cassandraCluster.create_cluster_with_protocol2() session =3D cluster.connect(); #session.default_timeout =3D 1800 session.set_keyspace(self.KEYSPACE_NAME) preparedStatement =3D session.prepare(sqlQuery) counter =3D 0 c =3D c + 1 for random_number in random_numbers: totalCounter +=3D 1 if counter =3D=3D 0 : batch =3D BatchStatement() counter +=3D 1 if totalCounter % 10000 =3D=3D 0 : print "Total Count "+ str(totalCounter) batch.add(preparedStatement.bind([ uuid.uuid1(), sample_na= me, random_number, random.choice('GT'), random.choice('AC')])) if counter % 50 =3D=3D 0: session.execute_async(batch) #session.execute(batch) batch =3D None del batch counter =3D 0 time.sleep(2); session.cluster.shutdown() random_number=3D None del random_number preparedStatement =3D None session =3D None del session cluster =3D None del cluster cassandraCluster =3D None del cassandraCluster gc.collect() =3D=3D=3DCODE =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D Kind regards, Rajesh Radhakrishnan **************************************************************************= The information contained in the EMail and any attachments is confidential= and intended solely and for the attention and use of the named addressee(= s). It may not be disclosed to any other person without the express author= ity of Public Health England, or the intended recipient, or both. If you a= re not the intended recipient, you must not disclose, copy, distribute or = retain this message or any part of it. This footnote also confirms that th= is EMail has been swept for computer viruses by Symantec.Cloud, but please= re-sweep any attachments before opening or saving. http://www.gov.uk/PHE<= redir.aspx?REF=3DA-VbX04jP6vOTA0I37LwhE8gyo2hWHvaqqScXhOC4NsZsB7TvwfUCAFod= HRwOi8vd3d3Lmdvdi51ay9QSEU.> **************************************************************************= **************************************************************************= The information contained in the EMail and any attachments is confidential= and intended solely and for the attention and use of the named addressee(= s). It may not be disclosed to any other person without the express author= ity of Public Health England, or the intended recipient, or both. If you a= re not the intended recipient, you must not disclose, copy, distribute or = retain this message or any part of it. This footnote also confirms that th= is EMail has been swept for computer viruses by Symantec.Cloud, but please= re-sweep any attachments before opening or saving. http://www.gov.uk/PHE **************************************************************************= --_000_0A9C05DECDEB6A4FAF7A3783EECB7800686ED358MAILMBXCOL02phe_ Content-Type: text/html; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable
Hi Lahiru,

Great! you know what, REDUCTION of BATCH size from 50 to 20 solved my issu= e.

Thank you very much. Good job man! and Memory issue solved.

Next I will try using Spark to speed it up.


Kind regards,
Rajesh Radhakrishnan 


From: Lahiru Gamathige [lahiru@highfive= .com]
Sent: 07 November 2016 17:10
To: user@cassandra.apache.org
Subject: Re: Cassandra Python Driver : execute_async consumes lots = of memory?

Hi Rajesh,

By looking at your code I see that the memory would definitely grow b= ecause you write big batches async and you will end up large number of bat= ch statements and the all end up slowing down. We recently migrated some d= ata to C* and what we did was we created a data stream and wrote in batches and used a library which is sensitive = to back-pressure of the stream. In your implementation there's is no back-= pressure to control it. We migrated data pretty fast by keeping the CPU 10= 0% constantly and achieve the highest performance (used Scala with akka-streams and phantom-websudo). 

I would consider using some streaming API to implement this. When you= do batching make sure you don't exceed the max match size, then things wi= ll slow down anyways.

Lahiru

On Mon, Nov 7, 2016 at 8:51 AM, Rajesh Radhakri= shnan <Rajesh.Radhakrishnan@phe.gov.uk> wrote:
Hi

We are trying to inject millions to data into a table by executing = Batches of PreparedStatments.

We found that when we use 'session.execute(batch)', it write more data but= very very slow.
However if we use  'session.execute_async(batch)' then its relatively= fast but when it reaches certain limit, its fillup the memory (python pro= cess)

Our implementation:
Cassandra 3.7.0 cluster  ring with 3 nodes (RedHat, 150GB Disk, 8GB o= f RAM each)

Python 2.7.12

Anybody know how to reduce the memory use of Cassandra-python driver API s= pecifically for execute_async? Thank you!



=3D=3D=3DCODE =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
      sqlQuery =3D "INSERT INTO tableV = (id, sample_name, pos, ref_base, var_base) values (?,?,?,?,?)"
       random_numbers_for_strains =3D random.sample(xr= ange(1,300), 200)
        random_numbers =3D random.sampl= e(xrange(1,2000000), 200000)
       
        totalCounter  =3D 0
        c =3D 0
        time_init =3D time.time()
        for random_number_strain in ran= dom_numbers_for_strains:
           
            sample_= name =3D None
            sample_= name =3D 'sample'+str(random_number_strain)
           
            cassand= raCluster =3D CassandraCluster.CassandraCluster()
            cluster= =3D cassandraCluster.create_cluster_with_protocol2()
            session= =3D cluster.connect();
            #sessio= n.default_timeout =3D 1800
            session= .set_keyspace(self.KEYSPACE_NAME)
           
            prepare= dStatement =3D session.prepare(sqlQuery)
           
            counter= =3D 0
            c =3D c= + 1
           
            for ran= dom_number in random_numbers:

            &n= bsp;   totalCounter +=3D 1
            &n= bsp;   if counter =3D=3D 0 :
            &n= bsp;       batch =3D BatchStatement()

            &n= bsp;   counter +=3D 1
            &n= bsp;   if totalCounter % 10000 =3D=3D 0 :
            &n= bsp;       print "Total Count "= 3; str(totalCounter)

            &n= bsp;   batch.add(preparedStatement.bind([ uuid.uuid1(), sam= ple_name, random_number, random.choice('GT'), random.choice('AC')]))
            &n= bsp;   if counter % 50 =3D=3D 0:
            &n= bsp;       session.execute_async(batch)
            &n= bsp;       #session.execute(batch)
            &n= bsp;       batch =3D None
            &n= bsp;       del batch
            &n= bsp;       counter =3D 0
           
            time.sl= eep(2);
            session= .cluster.shutdown()
            random_= number=3D None
            del ran= dom_number
            prepare= dStatement =3D None
            session= =3D None
            del ses= sion
            cluster= =3D None
            del clu= ster
            cassand= raCluster =3D None
            del cas= sandraCluster
            gc.coll= ect()          
           
=3D=3D= =3DCODE =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D



Kind regards,
Rajesh Radhakrishnan
<= font size=3D"2">


****************************************************************= **********
The information contained in the EMail and any attachments is confidential= and intended solely and for the attention and use of the named addressee(= s). It may not be disclosed to any other person without the express author= ity of Public Health England, or the intended recipient, or both. If you are not the intended recipient, you m= ust not disclose, copy, distribute or retain this message or any part of i= t. This footnote also confirms that this EMail has been swept for computer= viruses by Symantec.Cloud, but please re-sweep any attachments before opening or saving. http://www.gov.uk/PHE
****************************************************************= **********


**************************************************************************=
The information contained in the EMail and any attachments is confidential= and intended solely and for the attention and use of the named addressee(= s). It may not be disclosed to any other person without the express author= ity of Public Health England, or the intended recipient, or both. If you a= re not the intended recipient, you must not disclose, copy, distribute or = retain this message or any part of it. This footnote also confirms that th= is EMail has been swept for computer viruses by Symantec.Cloud, but please= re-sweep any attachments before opening or saving. http://www.gov.uk/PHE<= BR> **************************************************************************=
--_000_0A9C05DECDEB6A4FAF7A3783EECB7800686ED358MAILMBXCOL02phe_--