cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Edward Capriolo <edlinuxg...@gmail.com>
Subject Re: An extremely fast cassandra table full scan utility
Date Mon, 03 Oct 2016 20:38:24 GMT
I undertook a similar effort a while ago.

https://issues.apache.org/jira/browse/CASSANDRA-7014

Other than the fact that it was closed with no comments, I can tell you
that other efforts I had to embed things in Cassandra did not go
swimmingly. Although at the time ideas were rejected like groovy udfs

On Mon, Oct 3, 2016 at 4:22 PM, Bhuvan Rawal <bhu1rawal@gmail.com> wrote:

> Hi Jonathan,
>
> If full scan is a regular requirement then setting up a spark cluster in
> locality with Cassandra nodes makes perfect sense. But supposing that it is
> a one off requirement, say a weekly or a fortnightly task, a spark cluster
> could be an added overhead with additional capacity, resource planning as
> far as operations / maintenance is concerned.
>
> So this could be thought a simple substitute for a single threaded scan
> without additional efforts to setup and maintain another technology.
>
> Regards,
> Bhuvan
>
> On Tue, Oct 4, 2016 at 1:37 AM, siddharth verma <
> sidd.verma29.list@gmail.com> wrote:
>
>> Hi Jon,
>> It wan't allowed.
>> Moreover, if someone who isn't familiar with spark, and might be new to
>> map filter reduce etc. operations, could also use the utility for some
>> simple operations assuming a sequential scan of the cassandra table.
>>
>> Regards
>> Siddharth Verma
>>
>> On Tue, Oct 4, 2016 at 1:32 AM, Jonathan Haddad <jon@jonhaddad.com>
>> wrote:
>>
>>> Couldn't set up as couldn't get it working, or its not allowed?
>>>
>>> On Mon, Oct 3, 2016 at 3:23 PM Siddharth Verma <
>>> verma.siddharth@snapdeal.com> wrote:
>>>
>>>> Hi Jon,
>>>> We couldn't setup a spark cluster.
>>>>
>>>> For some use case, a spark cluster was required, but for some reason we
>>>> couldn't create spark cluster. Hence, one may use this utility to iterate
>>>> through the entire table at very high speed.
>>>>
>>>> Had to find a work around, that would be faster than paging on result
>>>> set.
>>>>
>>>> Regards
>>>>
>>>> Siddharth Verma
>>>> *Software Engineer I - CaMS*
>>>> *M*: +91 9013689856, *T*: 011 22791596 *EXT*: 14697
>>>> CA2125, 2nd Floor, ASF Centre-A, Jwala Mill Road,
>>>> Udyog Vihar Phase - IV, Gurgaon-122016, INDIA
>>>> Download Our App
>>>> [image: A]
>>>> <https://play.google.com/store/apps/details?id=com.snapdeal.main&utm_source=mobileAppLp&utm_campaign=android>
[image:
>>>> A]
>>>> <https://itunes.apple.com/in/app/snapdeal-mobile-shopping/id721124909?ls=1&mt=8&utm_source=mobileAppLp&utm_campaign=ios>
[image:
>>>> W]
>>>> <http://www.windowsphone.com/en-in/store/app/snapdeal/ee17fccf-40d0-4a59-80a3-04da47a5553f>
>>>>
>>>> On Tue, Oct 4, 2016 at 12:41 AM, Jonathan Haddad <jon@jonhaddad.com>
>>>> wrote:
>>>>
>>>> It almost sounds like you're duplicating all the work of both spark and
>>>> the connector. May I ask why you decided to not use the existing tools?
>>>>
>>>> On Mon, Oct 3, 2016 at 2:21 PM siddharth verma <
>>>> sidd.verma29.list@gmail.com> wrote:
>>>>
>>>> Hi DuyHai,
>>>> Thanks for your reply.
>>>> A few more features planned in the next one(if there is one) like,
>>>> custom policy keeping in mind the replication of token range on
>>>> specific nodes,
>>>> fine graining the token range(for more speedup),
>>>> and a few more.
>>>>
>>>> I think, as fine graining a token range,
>>>> If one token range is split further in say, 2-3 parts, divided among
>>>> threads, this would exploit the possible parallelism on a large scaled out
>>>> cluster.
>>>>
>>>> And, as you mentioned the JIRA, streaming of request, that would of
>>>> huge help with further splitting the range.
>>>>
>>>> Thanks once again for your valuable comments. :-)
>>>>
>>>> Regards,
>>>> Siddharth Verma
>>>>
>>>>
>>>>
>>
>

Mime
View raw message