cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jake Luciani <jak...@gmail.com>
Subject Re: Suggested settings for number crunching
Date Thu, 18 Aug 2011 15:21:30 GMT
So you only have 1 cassandra node?

If you are interested only in getting the complete work done as fast as
possible before you begin reading, take a look at the new bulk loader in
cassandra:

http://www.datastax.com/dev/blog/bulk-loading

-Jake

On Thu, Aug 18, 2011 at 11:03 AM, Paul Loy <keteracel@gmail.com> wrote:

> Yeah, we're processing item similarities. So we are writing single columns
> at a time. Although we do batch these into 400 mutations before sending to
> Cassy. We currently perform almost 2 billion calculations that then write
> almost 4 billion columns.
>
> Once all similarities are calculated, we just grab a slice per item and
> create a denormalised vector of similar items (trimmed down to topN and only
> those above a certain threshold). This makes lookup super fast as we only
> get one column from cassandra.
>
> So we just want to optimise the crunching and storing phase as that's a
> O(n^2) complexity problem. The quicker we can make that the quicker the
> whole process works.
>
> I'm going to try disabling minor compactions as a start.
>
>
> > is the loading disk or cpu or network bound?
>
> cpu is at 40% free
> only one cassy node on the same box as the processor for now so no network
> traffic
> so I think it's disk access. Will find out for sure tomorrow after the
> current test runs.
>
> Thanks,
>
> Paul.
>
>
> On Thu, Aug 18, 2011 at 2:23 PM, Jake Luciani <jakers@gmail.com> wrote:
>
>> Are you writing lots of tiny rows or a few very large rows, are you
>> batching mutations? is the loading disk or cpu or network bound?
>>
>> -Jake
>>
>> On Thu, Aug 18, 2011 at 7:08 AM, Paul Loy <keteracel@gmail.com> wrote:
>>
>>> Hi All,
>>>
>>> I have a program that crunches through around 3 billion calculations. We
>>> store the result of each of these in cassandra to later query once in order
>>> to create some vectors. Our processing is limited by Cassandra now, rather
>>> than the calculations themselves.
>>>
>>> I was wondering what settings I can change to increase the write
>>> throughput. Perhaps disabling all caching, etc, as I won't be able to keep
>>> it all in memory anyway and only want to query the results once.
>>>
>>> Any thoughts would be appreciated,
>>>
>>> Paul.
>>>
>>> --
>>> ---------------------------------------------
>>> Paul Loy
>>> paul@keteracel.com
>>> http://uk.linkedin.com/in/paulloy
>>>
>>
>>
>>
>> --
>> http://twitter.com/tjake
>>
>
>
>
> --
> ---------------------------------------------
> Paul Loy
> paul@keteracel.com
> http://uk.linkedin.com/in/paulloy
>



-- 
http://twitter.com/tjake

Mime
View raw message