Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 15F0A200B98 for ; Mon, 3 Oct 2016 22:38:32 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id 144A2160ADC; Mon, 3 Oct 2016 20:38:32 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id B3A6C160ACD for ; Mon, 3 Oct 2016 22:38:30 +0200 (CEST) Received: (qmail 30247 invoked by uid 500); 3 Oct 2016 20:38:29 -0000 Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@cassandra.apache.org Delivered-To: mailing list user@cassandra.apache.org Received: (qmail 30237 invoked by uid 99); 3 Oct 2016 20:38:29 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd2-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 03 Oct 2016 20:38:29 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd2-us-west.apache.org (ASF Mail Server at spamd2-us-west.apache.org) with ESMTP id DFD8F1A003F for ; Mon, 3 Oct 2016 20:38:28 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd2-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 2.389 X-Spam-Level: ** X-Spam-Status: No, score=2.389 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, HTML_MESSAGE=2, RCVD_IN_DNSWL_NONE=-0.0001, RCVD_IN_MSPIKE_H3=-0.01, RCVD_IN_MSPIKE_WL=-0.01, RCVD_IN_SORBS_SPAM=0.5, SPF_PASS=-0.001, T_REMOTE_IMAGE=0.01] autolearn=disabled Authentication-Results: spamd2-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx2-lw-eu.apache.org ([10.40.0.8]) by localhost (spamd2-us-west.apache.org [10.40.0.9]) (amavisd-new, port 10024) with ESMTP id L6CeOP65y563 for ; Mon, 3 Oct 2016 20:38:26 +0000 (UTC) Received: from mail-lf0-f45.google.com (mail-lf0-f45.google.com [209.85.215.45]) by mx2-lw-eu.apache.org (ASF Mail Server at mx2-lw-eu.apache.org) with ESMTPS id 2A5C55F47D for ; Mon, 3 Oct 2016 20:38:26 +0000 (UTC) Received: by mail-lf0-f45.google.com with SMTP id l131so166896431lfl.2 for ; Mon, 03 Oct 2016 13:38:26 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:from:date:message-id:subject:to; bh=c7ApAEAL2Tdrwl8EK/gcWNh3StCMyKimXxXo0PBdos8=; b=VS38aDVBTO3Sxkpj0BQmfc75mLRu0lxRExfMYP9ug5YoY1NdMcyS6jOhZ3t38Un8uG UIbfKAG7jx+IL1ulnE5uWs9FERZbFPmf23kIAYbZZkhcV95mZM9X01UKCCDeH6a04/vu RaOWhUY98dGrZezPikAcDAxUOZ3CmU6+AVwq4fGuE7S6gWAvqhritnRI/GKwyl5CJr6D x2L/Id/YGPgJmD2DXAd4iRP62LP5VA/n/3zhqNmFk9XDhyO7BEZ03FW5+5HzZCPLvjzq d6kFDYeuyA3cp/DAOtXjMq/lSVnmw8kF7pFEIvQuzfuFuNb1g3TMYd/xC1DuDwSw5Ysh abgw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:in-reply-to:references:from:date :message-id:subject:to; bh=c7ApAEAL2Tdrwl8EK/gcWNh3StCMyKimXxXo0PBdos8=; b=YaxTYYR1eyiw8te6K1wgVt7KEzFh8VsZkoj9VQ4ufQRaHzjlJboN0eDn3k+TIvzKRh Zg/wb+Sj4Pxr7qZZvpDaB3YFj5SQeOEeNn0qKAn4OEcSQani7j35XSp+GMYbIqO490Zy vHs4j9dmvuZ39u+HtogF4hCzpRJEsi+ArFQej/XgIFW7YKeXRocmnZfcGqob5HZYNPsu MsV4Agq+o6atqUWaHy++18coxBNLpG5iDexFopYfBhih69o0SViXEasT9BRWD/4ATeyq 96Ta5eKAWpLPQXkqL94PKy1hoz0Oqp23pw44w4FGBdz30vISQHiw+jnjV71/QgV/i+pO Fekg== X-Gm-Message-State: AA6/9Rk1T1i9Q8mUggtxCGKaVxJ2gFam2HMqtQOjP/I1nkDyVtwwJnTP5e8QlbOQEmy0WwTSP+7H3LE4sEUURQ== X-Received: by 10.25.26.203 with SMTP id a194mr2030861lfa.30.1475527105342; Mon, 03 Oct 2016 13:38:25 -0700 (PDT) MIME-Version: 1.0 Received: by 10.25.75.141 with HTTP; Mon, 3 Oct 2016 13:38:24 -0700 (PDT) In-Reply-To: References: From: Edward Capriolo Date: Mon, 3 Oct 2016 16:38:24 -0400 Message-ID: Subject: Re: An extremely fast cassandra table full scan utility To: "user@cassandra.apache.org" Content-Type: multipart/alternative; boundary=001a11401bb8f874c2053dfbec68 archived-at: Mon, 03 Oct 2016 20:38:32 -0000 --001a11401bb8f874c2053dfbec68 Content-Type: text/plain; charset=UTF-8 I undertook a similar effort a while ago. https://issues.apache.org/jira/browse/CASSANDRA-7014 Other than the fact that it was closed with no comments, I can tell you that other efforts I had to embed things in Cassandra did not go swimmingly. Although at the time ideas were rejected like groovy udfs On Mon, Oct 3, 2016 at 4:22 PM, Bhuvan Rawal wrote: > Hi Jonathan, > > If full scan is a regular requirement then setting up a spark cluster in > locality with Cassandra nodes makes perfect sense. But supposing that it is > a one off requirement, say a weekly or a fortnightly task, a spark cluster > could be an added overhead with additional capacity, resource planning as > far as operations / maintenance is concerned. > > So this could be thought a simple substitute for a single threaded scan > without additional efforts to setup and maintain another technology. > > Regards, > Bhuvan > > On Tue, Oct 4, 2016 at 1:37 AM, siddharth verma < > sidd.verma29.list@gmail.com> wrote: > >> Hi Jon, >> It wan't allowed. >> Moreover, if someone who isn't familiar with spark, and might be new to >> map filter reduce etc. operations, could also use the utility for some >> simple operations assuming a sequential scan of the cassandra table. >> >> Regards >> Siddharth Verma >> >> On Tue, Oct 4, 2016 at 1:32 AM, Jonathan Haddad >> wrote: >> >>> Couldn't set up as couldn't get it working, or its not allowed? >>> >>> On Mon, Oct 3, 2016 at 3:23 PM Siddharth Verma < >>> verma.siddharth@snapdeal.com> wrote: >>> >>>> Hi Jon, >>>> We couldn't setup a spark cluster. >>>> >>>> For some use case, a spark cluster was required, but for some reason we >>>> couldn't create spark cluster. Hence, one may use this utility to iterate >>>> through the entire table at very high speed. >>>> >>>> Had to find a work around, that would be faster than paging on result >>>> set. >>>> >>>> Regards >>>> >>>> Siddharth Verma >>>> *Software Engineer I - CaMS* >>>> *M*: +91 9013689856, *T*: 011 22791596 *EXT*: 14697 >>>> CA2125, 2nd Floor, ASF Centre-A, Jwala Mill Road, >>>> Udyog Vihar Phase - IV, Gurgaon-122016, INDIA >>>> Download Our App >>>> [image: A] >>>> [image: >>>> A] >>>> [image: >>>> W] >>>> >>>> >>>> On Tue, Oct 4, 2016 at 12:41 AM, Jonathan Haddad >>>> wrote: >>>> >>>> It almost sounds like you're duplicating all the work of both spark and >>>> the connector. May I ask why you decided to not use the existing tools? >>>> >>>> On Mon, Oct 3, 2016 at 2:21 PM siddharth verma < >>>> sidd.verma29.list@gmail.com> wrote: >>>> >>>> Hi DuyHai, >>>> Thanks for your reply. >>>> A few more features planned in the next one(if there is one) like, >>>> custom policy keeping in mind the replication of token range on >>>> specific nodes, >>>> fine graining the token range(for more speedup), >>>> and a few more. >>>> >>>> I think, as fine graining a token range, >>>> If one token range is split further in say, 2-3 parts, divided among >>>> threads, this would exploit the possible parallelism on a large scaled out >>>> cluster. >>>> >>>> And, as you mentioned the JIRA, streaming of request, that would of >>>> huge help with further splitting the range. >>>> >>>> Thanks once again for your valuable comments. :-) >>>> >>>> Regards, >>>> Siddharth Verma >>>> >>>> >>>> >> > --001a11401bb8f874c2053dfbec68 Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable
I undertook a similar effort a while ago.=C2=A0

https://issue= s.apache.org/jira/browse/CASSANDRA-7014

Other than the fact that= it was closed with no comments, I can tell you that other efforts I had to= embed things in Cassandra did not go swimmingly. Although at the time idea= s were rejected like groovy udfs=C2=A0

=
On Mon, Oct 3, 2016 at 4:22 PM, Bhuvan Rawal <bhu1rawal@gmail.com> wrote:
Hi Jonathan,

If full scan is a = regular requirement then setting up a spark cluster in locality with Cassan= dra nodes makes perfect sense. But supposing that it is a one off requireme= nt, say a weekly or a fortnightly task, a spark cluster could be an added o= verhead with additional capacity, resource planning as far as operations / = maintenance is concerned.=C2=A0

So this could be thought= a simple substitute for a single threaded scan without additional efforts = to setup and maintain another technology.

Regards,=
Bhuvan

On Tue, Oct 4, 2016 at 1:37 AM, siddharth verma <sid= d.verma29.list@gmail.com> wrote:
Hi Jon,=C2=A0
It wan't allowed.
Moreover= , if someone who isn't familiar with spark, and might be new to map fil= ter reduce etc. operations, could also use the utility for some simple oper= ations assuming a sequential scan of the cassandra table.
<= br>
Regards
Siddharth Verma

On Tue, Oct 4, 2= 016 at 1:32 AM, Jonathan Haddad <jon@jonhaddad.com> wrote:
Couldn't set up as couldn't get it= working, or its not allowed?

On Mon, Oct 3, 2016 at= 3:23 PM Siddharth Verma <verma.siddharth@snapdeal.com> wrote:
Hi Jon,
We couldn't setup a spark cluster.

For some use case, a spark cluster was required= , but for some=20 reason we couldn't create spark cluster. Hence, one may use this utilit= y to iterate through the entire table at very high speed.

Had to fin= d a work around, that would be faster than paging on result set.

Regards
=
Siddharth VermaSoftware Engineer I - CaMSM: +91 9013689856,=C2= =A0T: 011 22791596=C2=A0EXT: 14697
CA2125, 2nd Floor, ASF Centre-A, Jwala Mill Road,= =C2=A0
Udyog Vihar Phase - IV, Gurgaon-122016, INDIA
Download Our App
3D"3D"= 3D"

On Tue, Oct 4, 2016 at 12:41 AM, Jon= athan Haddad <jon@jonhaddad.com> wrote:
It almost sounds like you're duplicating all the work of both= spark and the connector. May I ask why you decided to not use the existing= tools?
On Mon, Oct = 3, 2016 at 2:21 PM siddharth verma <sidd.verma29.list@gmail.com> w= rote:
Hi DuyHai,
Thanks for your reply.
A fe= w more features planned in the next one(if there is one) like,
custom policy keeping in mind the replication of token range on spec= ific nodes,
fine graining the token range(for more speed= up),=C2=A0
and a few more.

I think, as fine graining a token range,
If one token range is split further in say, 2-3 parts, divided amo= ng threads, this would exploit the possible parallelism on a large scaled o= ut cluster.

And, as you= mentioned the JIRA, streaming of request, that would of huge help with fur= ther splitting the range.

Thanks once again for your valuable comments. :-)=C2=A0

Regards,
Siddharth = Verma




--001a11401bb8f874c2053dfbec68--