Return-Path: X-Original-To: apmail-cassandra-user-archive@www.apache.org Delivered-To: apmail-cassandra-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 75956CDA1 for ; Thu, 5 Jun 2014 17:09:59 +0000 (UTC) Received: (qmail 53068 invoked by uid 500); 5 Jun 2014 17:09:56 -0000 Delivered-To: apmail-cassandra-user-archive@cassandra.apache.org Received: (qmail 53027 invoked by uid 500); 5 Jun 2014 17:09:56 -0000 Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@cassandra.apache.org Delivered-To: mailing list user@cassandra.apache.org Received: (qmail 53019 invoked by uid 99); 5 Jun 2014 17:09:56 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 05 Jun 2014 17:09:56 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of marcelo@s1mbi0se.com.br designates 209.85.223.173 as permitted sender) Received: from [209.85.223.173] (HELO mail-ie0-f173.google.com) (209.85.223.173) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 05 Jun 2014 17:09:53 +0000 Received: by mail-ie0-f173.google.com with SMTP id lx4so1174809iec.18 for ; Thu, 05 Jun 2014 10:09:28 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=s1mbi0se.com.br; s=google; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; bh=oZ6cX3b/EBMMo71jXByXHn0Qjbgrn5Guzb6do+k/Czs=; b=g2oJmouTTdZjnlw3zncMXSdQWH9y17wETJlni9X/1hWFxesveFH4W1CXe5EZ90ez19 kMhlOLefBzuZMTZ3sbxr3zeYD/KaX9OF2cA4fYLdum/SHmgJNovy5kxD0Vx3VsX9bIdc 3eV6peE6EpvHbdv16joeDj5TYCGDJV/+MXwWc= X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:in-reply-to:references:date :message-id:subject:from:to:content-type; bh=oZ6cX3b/EBMMo71jXByXHn0Qjbgrn5Guzb6do+k/Czs=; b=QM0no2qqly8hz9ji4sukmONXZYZimXRdPqBsXunqgGROz01qSqYDYu1Luph16Ni7Jo HKzFSlwL+AW1st24suzcgEm5Lk52mUZA4bHslb+c+zo5kWBPJp2NkzweQkPNpxgj6NjP 0yWKdU56AL1qZcBR7VCFG22FrhCAgcTdYaItT/nhYH8QSYQIZxRVq5PMr29jnYQCRnAV gHdLc4Hs2juMQ74d+BPGdPtw/3/boSQqWdAYNmMUYG0DVBwippzkdXpFt82CDy1Kdenh Lpk4oKQM0ewVGKGjK9KzWTteRfEH6ywDpPyzpMXtHTyj5glGEixwpOFtvsazcpPCtXOZ i00w== X-Gm-Message-State: ALoCoQk2Ca3N+M4gfBEIVOoTfpbSyYqqaBVPw8J4CyBo3TaxCj1kEkkIEF+UE46pu2unXy5Q1keE MIME-Version: 1.0 X-Received: by 10.50.97.68 with SMTP id dy4mr82056igb.8.1401988168822; Thu, 05 Jun 2014 10:09:28 -0700 (PDT) Received: by 10.64.16.233 with HTTP; Thu, 5 Jun 2014 10:09:28 -0700 (PDT) X-Originating-IP: [189.101.187.145] In-Reply-To: References: Date: Thu, 5 Jun 2014 14:09:28 -0300 Message-ID: Subject: Re: migration to a new model From: Marcelo Elias Del Valle To: user@cassandra.apache.org Content-Type: multipart/alternative; boundary=047d7b10c78fc84ff104fb19cef8 X-Virus-Checked: Checked by ClamAV on apache.org --047d7b10c78fc84ff104fb19cef8 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable Michael, I will try to test it up to tomorrow and I will let you know all the results. Thanks a lot! Best regards, Marcelo. 2014-06-04 22:28 GMT-03:00 Laing, Michael : > BTW you might want to put a LIMIT clause on your SELECT for testing. -ml > > > On Wed, Jun 4, 2014 at 6:04 PM, Laing, Michael > wrote: > >> Marcelo, >> >> Here is a link to the preview of the python fast copy program: >> >> https://gist.github.com/michaelplaing/37d89c8f5f09ae779e47 >> >> It will copy a table from one cluster to another with some >> transformation- they can be the same cluster. >> >> It has 3 main throttles to experiment with: >> >> 1. fetch_size: size of source pages in rows >> 2. worker_count: number of worker subprocesses >> 3. concurrency: number of async callback chains per worker subprocess >> >> It is easy to overrun Cassandra and the python driver, so I recommend >> starting with the defaults: fetch_size: 1000; worker_count: 2; concurren= cy: >> 10. >> >> Additionally there are switches to set 'policies' by source and >> destination: retry (downgrade consistency), dc_aware, and token_aware. >> retry is useful if you are getting timeouts. For the others YMMV. >> >> To use it you need to define the SELECT and UPDATE cql statements as wel= l >> as the 'map_fields' method. >> >> The worker subprocesses divide up the token range among themselves and >> proceed quasi-independently. Each worker opens a connection to each clus= ter >> and the driver sets up connection pools to the nodes in the cluster. Any= way >> there are a lot of processes, threads, callbacks going at once so it is = fun >> to watch. >> >> On my regional cluster of small nodes in AWS I got about 3000 rows per >> second transferred after things warmed up a bit - each row about 6kb. >> >> ml >> >> >> On Wed, Jun 4, 2014 at 11:49 AM, Laing, Michael < >> michael.laing@nytimes.com> wrote: >> >>> OK Marcelo, I'll work on it today. -ml >>> >>> >>> On Tue, Jun 3, 2014 at 8:24 PM, Marcelo Elias Del Valle < >>> marcelo@s1mbi0se.com.br> wrote: >>> >>>> Hi Michael, >>>> >>>> For sure I would be interested in this program! >>>> >>>> I am new both to python and for cql. I started creating this copier, >>>> but was having problems with timeouts. Alex solved my problem here on = the >>>> list, but I think I will still have a lot of trouble making the copy t= o >>>> work fine. >>>> >>>> I open sourced my version here: >>>> https://github.com/s1mbi0se/cql_record_processor >>>> >>>> Just in case it's useful for anything. >>>> >>>> However, I saw CQL has support for concurrency itself and having >>>> something made by someone who knows Python CQL Driver better would be = very >>>> helpful. >>>> >>>> My two servers today are at OVH (ovh.com), we have servers at AWS but >>>> but several cases we prefer other hosts. Both servers have SDD and 64 = Gb >>>> RAM, I could use the script as a benchmark for you if you want. Beside= s, we >>>> have some bigger clusters, I could run on the just to test the speed i= f >>>> this is going to help. >>>> >>>> Regards >>>> Marcelo. >>>> >>>> >>>> 2014-06-03 11:40 GMT-03:00 Laing, Michael : >>>> >>>> Hi Marcelo, >>>>> >>>>> I could create a fast copy program by repurposing some python apps >>>>> that I am using for benchmarking the python driver - do you still nee= d this? >>>>> >>>>> With high levels of concurrency and multiple subprocess workers, base= d >>>>> on my current actual benchmarks, I think I can get well over 1,000 >>>>> rows/second on my mac and significantly more in AWS. I'm using variab= le >>>>> size rows averaging 5kb. >>>>> >>>>> This would be the initial version of a piece of the benchmark suite w= e >>>>> will release as part of our nyt=E2=A8=8Da=D0=B1rik project on 21 June= for my >>>>> Cassandra Day NYC talk re the python driver. >>>>> >>>>> ml >>>>> >>>>> >>>>> On Mon, Jun 2, 2014 at 2:15 PM, Marcelo Elias Del Valle < >>>>> marcelo@s1mbi0se.com.br> wrote: >>>>> >>>>>> Hi Jens, >>>>>> >>>>>> Thanks for trying to help. >>>>>> >>>>>> Indeed, I know I can't do it using just CQL. But what would you use >>>>>> to migrate data manually? I tried to create a python program using a= uto >>>>>> paging, but I am getting timeouts. I also tried Hive, but no success= . >>>>>> I only have two nodes and less than 200Gb in this cluster, any simpl= e >>>>>> way to extract the data quickly would be good enough for me. >>>>>> >>>>>> Best regards, >>>>>> Marcelo. >>>>>> >>>>>> >>>>>> >>>>>> 2014-06-02 15:08 GMT-03:00 Jens Rantil : >>>>>> >>>>>> Hi Marcelo, >>>>>>> >>>>>>> Looks like you can't do this without migrating your data manually: >>>>>>> https://stackoverflow.com/questions/18421668/alter-cassandra-column= -family-primary-key-using-cassandra-cli-or-cql >>>>>>> >>>>>>> Cheers, >>>>>>> Jens >>>>>>> >>>>>>> >>>>>>> On Mon, Jun 2, 2014 at 7:48 PM, Marcelo Elias Del Valle < >>>>>>> marcelo@s1mbi0se.com.br> wrote: >>>>>>> >>>>>>>> Hi, >>>>>>>> >>>>>>>> I have some cql CFs in a 2 node Cassandra 2.0.8 cluster. >>>>>>>> >>>>>>>> I realized I created my column family with the wrong partition. >>>>>>>> Instead of: >>>>>>>> >>>>>>>> CREATE TABLE IF NOT EXISTS entity_lookup ( >>>>>>>> name varchar, >>>>>>>> value varchar, >>>>>>>> entity_id uuid, >>>>>>>> PRIMARY KEY ((name, value), entity_id)) >>>>>>>> WITH >>>>>>>> caching=3Dall; >>>>>>>> >>>>>>>> I used: >>>>>>>> >>>>>>>> CREATE TABLE IF NOT EXISTS entitylookup ( >>>>>>>> name varchar, >>>>>>>> value varchar, >>>>>>>> entity_id uuid, >>>>>>>> PRIMARY KEY (name, value, entity_id)) >>>>>>>> WITH >>>>>>>> caching=3Dall; >>>>>>>> >>>>>>>> >>>>>>>> Now I need to migrate the data from the second CF to the first one= . >>>>>>>> I am using Data Stax Community Edition. >>>>>>>> >>>>>>>> What would be the best way to convert data from one CF to the othe= r? >>>>>>>> >>>>>>>> Best regards, >>>>>>>> Marcelo. >>>>>>>> >>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >> > --047d7b10c78fc84ff104fb19cef8 Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable
Michael,=C2=A0

I will try to test it up= to tomorrow and I will let you know all the results.

<= div>Thanks a lot!

Best regards,
Marcelo.=


2014-06= -04 22:28 GMT-03:00 Laing, Michael <michael.laing@nytimes.com&= gt;:
BTW you might want to put a= LIMIT clause on your SELECT for testing. -ml


On Wed, Jun 4= , 2014 at 6:04 PM, Laing, Michael <michael.laing@nytimes.com&g= t; wrote:
Marcelo,

Here is a link to the preview of the python fast copy program:


It will copy a table from one cluster to another = with some transformation- they can be the same cluster.

It has 3 main throttles to experiment with:=C2=A0
  1. f= etch_size: size of source pages in rows
  2. worker_count: number of worker subprocesses
  3. concurrency: number= of async callback chains per worker subprocess
It is easy to= overrun Cassandra and the python driver, so I recommend starting with the = defaults: fetch_size: 1000; worker_count: 2; concurrency: 10.

Additionally there are switches to set 'polic= ies' by source and destination: retry (downgrade consistency), dc_aware= , and token_aware. retry is useful if you are getting timeouts. For the oth= ers YMMV.

To use it you need to define the SELECT and UPDATE cql = statements as well as the 'map_fields' method.

=
The worker subprocesses divide up the token range among themselves and= proceed quasi-independently. Each worker opens a connection to each cluste= r and the driver sets up connection pools to the nodes in the cluster. Anyw= ay there are a lot of processes, threads, callbacks going at once so it is = fun to watch.

On my regional cluster of small nodes in AWS I got abou= t 3000 rows per second transferred after things warmed up a bit - each row = about 6kb.

ml


On Wed, Jun 4, 2014 at 11:49 AM, Laing, = Michael <michael.laing@nytimes.com> wrote:
OK Marcelo, I'll work on it today. -ml
=


On Tue, Jun 3= , 2014 at 8:24 PM, Marcelo Elias Del Valle <marcelo@s1mbi0se.com.br<= /a>> wrote:
Hi Michael,=C2=A0

<= /div>
For sure I would be interested in this program!=C2=A0
<= br>
I am new both to python and for cql. I started creating this copier, but wa= s having problems with timeouts. Alex solved my problem here on the list, b= ut I think I will still have a lot of trouble making the copy to work fine.=


Just in cas= e it's useful for anything.=C2=A0

However, I saw CQL has support for concurrency itself a= nd having something made by someone who knows Python CQL Driver better woul= d be very helpful.

My two servers today are at OVH= (ovh.com), we have server= s at AWS but but several cases we prefer other hosts. Both servers have SDD= and 64 Gb RAM, I could use the script as a benchmark for you if you want. = Besides, we have some bigger clusters, I could run on the just to test the = speed if this is going to help.

Regards
Marcelo.


2014-06-03 11:40 GMT-03:00 Laing, Mi= chael <michael.laing@nytimes.com>:

Hi Marcelo,

I could cr= eate a fast copy program by repurposing some python apps that I am using fo= r benchmarking the python driver - do you still need this?

With high levels of concurrency and multiple subprocess= workers, based on my current actual benchmarks, I think I can get well ove= r 1,000 rows/second on my mac and significantly more in AWS. I'm using = variable size rows averaging 5kb.

This would be the initial version of a piece of the ben= chmark suite we will release as part of our=C2=A0nyt=E2=A8=8Da= =D0=B1rik project on 21 June for my Cassandra Day NYC talk re the python dr= iver.

ml


On Mon,= Jun 2, 2014 at 2:15 PM, Marcelo Elias Del Valle <marcelo@s1mbi0se.c= om.br> wrote:
Hi Jens,=C2=A0

Thanks = for trying to help.

Indeed, I know I can't do it using just CQL. But wh= at would you use to migrate data manually? I tried to create a python progr= am using auto paging, but I am getting timeouts. I also tried Hive, but no = success.=C2=A0
I only have two nodes and less than 200Gb in this cluster, any simple = way to extract the data quickly would be good enough for me.

=
Best regards,
Marcelo.



2014-06-02 15:08 GMT-03:00 Jens Rantil <= span dir=3D"ltr"><jens.rantil@tink.se>:

Hi Marcelo,


Cheers,
Jens


On Mon, Jun 2, 2014 at 7:48 PM, Ma= rcelo Elias Del Valle <marcelo@s1mbi0se.com.br> wrote:=
Hi,=C2=A0

I have some = cql CFs in a 2 node Cassandra 2.0.8 cluster.

I realized I created my column family with the wrong partition. Instead of:=
CREATE TABLE IF NOT EXISTS entity_lookup (
  name varchar,
  value varchar,
  entity_id uuid,
  PRIMARY KEY ((name, value), entity_id))
WITH
    caching=3Dall
;
I used:<= /div>
CREATE TABLE IF NOT EXISTS entitylookup (
  name varchar,
  value varchar,
  entity_id uuid,
  PRIMARY KEY (name, value, entity_id))
WITH
    caching=3Dall
;

Now I need to migrate the data from the second CF to the first one.= =C2=A0
I am using Data Stax Community Edition.=C2=A0

What would be the best way to convert data from one CF to the other?
=

Best regards,
Marcelo.








--047d7b10c78fc84ff104fb19cef8--