Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 4D47B200C63 for ; Thu, 27 Apr 2017 07:28:37 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id 4BD33160BB4; Thu, 27 Apr 2017 05:28:37 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 233DA160BA8 for ; Thu, 27 Apr 2017 07:28:35 +0200 (CEST) Received: (qmail 26358 invoked by uid 500); 27 Apr 2017 05:28:34 -0000 Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@cassandra.apache.org Delivered-To: mailing list user@cassandra.apache.org Received: (qmail 26243 invoked by uid 99); 27 Apr 2017 05:28:34 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd2-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 27 Apr 2017 05:28:34 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd2-us-west.apache.org (ASF Mail Server at spamd2-us-west.apache.org) with ESMTP id E7F791A08AD for ; Thu, 27 Apr 2017 05:28:33 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd2-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 2.49 X-Spam-Level: ** X-Spam-Status: No, score=2.49 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, HTML_MESSAGE=2, RCVD_IN_DNSWL_NONE=-0.0001, RCVD_IN_MSPIKE_H3=-0.01, RCVD_IN_MSPIKE_WL=-0.01, RCVD_IN_SORBS_SPAM=0.5, SPF_PASS=-0.001, T_REMOTE_IMAGE=0.01, URIBL_BLOCKED=0.001] autolearn=disabled Authentication-Results: spamd2-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=instaclustr-com.20150623.gappssmtp.com Received: from mx1-lw-eu.apache.org ([10.40.0.8]) by localhost (spamd2-us-west.apache.org [10.40.0.9]) (amavisd-new, port 10024) with ESMTP id 5sGpPI9Jyvua for ; Thu, 27 Apr 2017 05:28:30 +0000 (UTC) Received: from mail-it0-f45.google.com (mail-it0-f45.google.com [209.85.214.45]) by mx1-lw-eu.apache.org (ASF Mail Server at mx1-lw-eu.apache.org) with ESMTPS id 6AD675F306 for ; Thu, 27 Apr 2017 05:28:29 +0000 (UTC) Received: by mail-it0-f45.google.com with SMTP id x188so5506062itb.0 for ; Wed, 26 Apr 2017 22:28:29 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=instaclustr-com.20150623.gappssmtp.com; s=20150623; h=mime-version:references:in-reply-to:from:date:message-id:subject:to; bh=a+if4pVPVqAkZFdfIBgM3YUdS5JRd/pqBw9nZPfKti0=; b=Uco4kD7MsDbspS2Ps3d74owxcHPa06lz5rYm9PKC29tmEFn0iZuvLkbnf++b6SC2TE tDSI1G98gyVjGdjRlf6aJjTFuprSqR5lpOgGaW2GQt9vwSUfmiKw4fl5bGel8hMlvQXB ZwLTugLZ5NnvpfYHapuEUmNaq7UT3o/mFWzbgG4QG4vIyVqjXQ5kPenoXJp1vmRbcskD bwv5mXuZ7NIt9jsBZj1TGyHRVIId0M7RjhvKULEa+L+oWjkunTK1QzXAxUs99DV+E0o8 qw6R1BSRTqkcjTHxRXXkOTBIkz54Fef7cesxdkp+SZBBnQ2ta+0tngY+UHo9ux95uxYq 3Wdg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to; bh=a+if4pVPVqAkZFdfIBgM3YUdS5JRd/pqBw9nZPfKti0=; b=N0+7PaurAurGl6LLWB70GyvAq2iYGf8KfK9lYr47P0UkMVJYpOB2f97mo4PEj7WkdC WTUAGKxMwwzIKZBeXqUpVkFI5bWdTH9FWZjxeUdJ2fqCA6nUToapOW0EulOaMWzgKpmf CAHjU/a1PWhSZEyjJxD69Zm24jj7mmL5kgyBzEE0Jy3VMFtiVhc1XSs6XLVjWp2dXPiC D65dtpfS3zVIgLis3aFeUljYgZMiweaPib8iA+BlzZXMAjknYiOnvdmbm0ulWWU7oDE8 MP2I+q4oyARvC9WS3CB75DvPYkMpPl8F/y2tSHVHR5pvk987DIUpqYknp1f73Kj9we8y jM0Q== X-Gm-Message-State: AN3rC/7S2yXhcgfUKYPQ1tYZpsZ7NPUAORKtSovPtOzeb89Mj/r9vY3b DZeunp7CrL1RvWCEAz0giS0CT6HXzarMbaU= X-Received: by 10.36.5.4 with SMTP id 4mr62151itl.54.1493270907919; Wed, 26 Apr 2017 22:28:27 -0700 (PDT) MIME-Version: 1.0 References: <9F1A623B-3B79-47EE-9766-C7BB58904C76@qvantel.com> <2BC1BD34-733D-42D8-93AA-4DEEF7F3527E@qvantel.com> In-Reply-To: <2BC1BD34-733D-42D8-93AA-4DEEF7F3527E@qvantel.com> From: Justin Cameron Date: Thu, 27 Apr 2017 05:28:17 +0000 Message-ID: Subject: Re: How can I efficiently export the content of my table to KAFKA To: "user@cassandra.apache.org" Content-Type: multipart/alternative; boundary=001a11439694053935054e1f3a78 archived-at: Thu, 27 Apr 2017 05:28:37 -0000 --001a11439694053935054e1f3a78 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable You can run multiple applications in parallel in Standalone mode - you just need to configure spark to allocate resources between your jobs the way you want (by default it assigns all resources to the first application you run, so they won't be freed up until it has finished). You can use Spark's web UI to check the resources that are available and those allocated to each job. See http://spark.apache.org/docs/latest/job-scheduling.html for more details. On Thu, 27 Apr 2017 at 15:12 Tobias Eriksson wrote: > Well, I have been working some with Spark and the biggest hurdle is that > Spark does not allow me to run multiple jobs in parallel > > i.e. at the point of starting the job to taking the table of =E2=80=9CInd= ividuals=E2=80=9D > I will have to wait until all that processing is done before I can start = an > additional one > > so I will need to upon demand start various additional jobs where I get > =E2=80=9CAddresses=E2=80=9D, =E2=80=9CInvoices=E2=80=9D, =E2=80=A6 and so= on > > I know I could increase number of Workers/Executors and use Mesos for > handling the scheduling and resource management but we have so far not be= en > able to get it dynamic/flexible enough > > Although I admit that this could still be a way forward we have not > evaluated it 100% yet, so I have not completely given up that thought > > > > -Tobias > > > > > > *From: *Justin Cameron > *Reply-To: *"user@cassandra.apache.org" > *Date: *Thursday, 27 April 2017 at 01:36 > *To: *"user@cassandra.apache.org" > *Subject: *Re: How can I efficiently export the content of my table to > KAFKA > > > > You could probably save yourself a lot of hassle by just writing a Spark > job that scans through the entire table, converts each row to JSON and > dumps the output into a Kafka topic. It should be fairly straightforward = to > implement. > > > > Spark will manage the partitioning of "Producer" processes for you - no > need for a "Coordinator" topic. > > > > On Thu, 27 Apr 2017 at 05:49 Tobias Eriksson > wrote: > > Hi > > I would like to make a dump of the database, in JSON format, to KAFKA > > The database contains lots of data, millions and in some cases billions o= f > =E2=80=9Crows=E2=80=9D > > I will provide the customer with an export of the data, where they can > read it off of a KAFKA topic > > > > My thinking was to have it scalable such that I will distribute the token > range of all available partition-keys to a number of (N) processes > (JSON-Producers) > > First I will have a process which will read through the available tokens > and then publish them on a KAFKA =E2=80=9CCoordinator=E2=80=9D Topic > > And then I can create 1, 10, 20 or N processes that will act as Producers > to the real KAFKA topic, and pick available tokens/partition-keys off of > the =E2=80=9CCoordinator=E2=80=9D Topic > > One by one until all the =E2=80=9Crows=E2=80=9D have been processed. > > So the JOSN-Producer will take e.g. a range of 1000 =E2=80=9Crows=E2=80= =9D and convert > them into my own JSON format and post to KAFKA > > And then after that take another 1000 =E2=80=9Crows=E2=80=9D and then =E2= =80=A6. And then another > 1000 =E2=80=9Crows=E2=80=9D and so on, until it is done. > > > > I base my idea on how I believe Apache Spark Connector accomplishes data > locality, i.e. being aware of where tokens reside and figured that since > that is possible it should be possible to create a job-list in a KAFKA > topic, and have each Producer pick jobs from there, and read up data from > Cassandra based on the partition key (token) and then post the JSON on th= e > export KAFKA topic. > > https://dzone.com/articles/data-locality-w-cassandra-how > > > > > > Would you consider this a good idea ? > > Would there in fact be a better idea, what would that be then ? > > > > -Tobias > > > > -- > > *Justin Cameron* > Senior Software Engineer > > > > > > > This email has been sent on behalf of Instaclustr Pty. Limited (Australia= ) > and Instaclustr Inc (USA). > > This email and any attachments may contain confidential and legally > privileged information. If you are not the intended recipient, do not co= py > or disclose its content, but please reply to this email immediately and > highlight the error to the sender and then immediately delete the message= . > --=20 *Justin Cameron*Senior Software Engineer This email has been sent on behalf of Instaclustr Pty. Limited (Australia) and Instaclustr Inc (USA). This email and any attachments may contain confidential and legally privileged information. If you are not the intended recipient, do not copy or disclose its content, but please reply to this email immediately and highlight the error to the sender and then immediately delete the message. --001a11439694053935054e1f3a78 Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable
You can run multiple applications in parallel in Standalon= e mode - you just need to configure spark to allocate resources between you= r jobs the way you want (by default it assigns all resources to the first a= pplication you run, so they won't be freed up until it has finished).= =C2=A0

You can use Spark's web UI to check the resou= rces that are available and those allocated to each job. See=C2=A0http://spark.a= pache.org/docs/latest/job-scheduling.html=C2=A0for more details.
<= /div>
On Thu, 27 Apr 2017 at= 15:12 Tobias Eriksson <t= obias.eriksson@qvantel.com> wrote:

Well, I have been working some with Spark and the biggest h= urdle is that Spark does not allow me to run multiple jobs in parallel

i.e. at the point of starting the job to taking the table o= f =E2=80=9CIndividuals=E2=80=9D I will have to wait until all that processi= ng is done before I can start an additional one

so I will need to upon demand start various additional jobs= where I get =E2=80=9CAddresses=E2=80=9D, =E2=80=9CInvoices=E2=80=9D, =E2= =80=A6 and so on

I know I could increase number of Workers/Executors and use= Mesos for handling the scheduling and resource management but we have so f= ar not been able to get it dynamic/flexible enough

Although I admit that this could still be a way forward we = have not evaluated it 100% yet, so I have not completely given up that thou= ght

=C2=A0

-Tobias

=C2=A0

=C2=A0

F= rom: Justin Cameron <justin@instaclustr.= com>
Reply-To: "user@cassandra.apache.org" <user@cassandra.apache.org> Date: Thursday, 27 April 2017 at 01:36
To: "user@cassandra.apache.org" <user@cassandra.apache.org>
Subject: Re: How can I efficiently export the content of my table to= KAFKA

=C2=A0

You could probably save yourself a lot of hassle by = just writing a Spark job that scans through the entire table, converts each= row to JSON and dumps the output into a Kafka topic. It should be fairly s= traightforward to implement.

=C2=A0

Spark will manage the partitioning of "Producer= " processes for you - no need for a "Coordinator" topic.<= /u>

=C2=A0

On Thu, 27 Apr 2017 at 05:49 Tobias Eriksson <tobias.erikss= on@qvantel.com> wrote:

Hi

I wo= uld like to make a dump of the database, in JSON format, to KAFKA=

The = database contains lots of data, millions and in some cases billions of =E2= =80=9Crows=E2=80=9D

I wi= ll provide the customer with an export of the data, where they can read it = off of a KAFKA topic

=C2= =A0

My t= hinking was to have it scalable such that I will distribute the token range= of all available partition-keys to a number of (N) processes (JSON-Producers)

Firs= t I will have a process which will read through the available tokens and th= en publish them on a KAFKA =E2=80=9CCoordinator=E2=80=9D Topic

And = then I can create 1, 10, 20 or N processes that will act as Producers to th= e real KAFKA topic, and pick available tokens/partition-keys off of the =E2=80=9CCoordinator=E2=80=9D Topic

One = by one until all the =E2=80=9Crows=E2=80=9D have been processed.<= /u>

So t= he JOSN-Producer will take e.g. a range of 1000 =E2=80=9Crows=E2=80=9D and = convert them into my own JSON format and post to KAFKA=

And = then after that take another 1000 =E2=80=9Crows=E2=80=9D and then =E2=80=A6= . And then another 1000 =E2=80=9Crows=E2=80=9D and so on, until it is done.=

=C2= =A0

I ba= se my idea on how I believe Apache Spark Connector accomplishes data locali= ty, i.e. being aware of where tokens reside and figured that since that is possible it should be possible to create a job-list in = a KAFKA topic, and have each Producer pick jobs from there, and read up dat= a from Cassandra based on the partition key (token) and then post the JSON = on the export KAFKA topic.

https://dzone.com/articles/data-locality-w-cassandra-how=

=C2= =A0

=C2= =A0

Woul= d you consider this a good idea ?

Woul= d there in fact be a better idea, what would that be then ?

=C2= =A0

-Tob= ias

=C2= =A0

--

Justin Camero= n
Senior Software Engineer



<= /span>=


This email has been sent on = behalf of=C2=A0Instaclustr Pty. Limited (Australia) and=C2=A0Instaclustr In= c (USA).

This email and any attachmen= ts may=C2=A0contain confidential and legally privileged=C2=A0information.= =C2=A0 If you are not the intended=C2=A0recipient, do not copy or disclose = its=C2=A0content, but please reply to this email=C2=A0immediately and highl= ight the error to the=C2=A0sender and then immediately delete the=C2=A0message.

--

Justin Cameron
Senior So= ftware Engineer



This e= mail has been sent on behalf of=C2=A0Instaclustr Pty. Limited (Australia) a= nd=C2=A0Instaclustr Inc (USA).

This email and any attachments may=C2=A0co= ntain confidential and legally privileged=C2=A0information.=C2=A0 If you ar= e not the intended=C2=A0recipient, do not copy or disclose its=C2=A0content= , but please reply to this email=C2=A0immediately and highlight the error t= o the=C2=A0sender and then immediately delete the=C2=A0message.

<= /div>
--001a11439694053935054e1f3a78--