incubator-cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jonathan Ellis <jbel...@gmail.com>
Subject Re: Suggested settings for number crunching
Date Thu, 18 Aug 2011 15:35:09 GMT
Step 0: use multiple threads to insert

On Thu, Aug 18, 2011 at 10:03 AM, Paul Loy <keteracel@gmail.com> wrote:
> Yeah, we're processing item similarities. So we are writing single columns
> at a time. Although we do batch these into 400 mutations before sending to
> Cassy. We currently perform almost 2 billion calculations that then write
> almost 4 billion columns.
>
> Once all similarities are calculated, we just grab a slice per item and
> create a denormalised vector of similar items (trimmed down to topN and only
> those above a certain threshold). This makes lookup super fast as we only
> get one column from cassandra.
>
> So we just want to optimise the crunching and storing phase as that's a
> O(n^2) complexity problem. The quicker we can make that the quicker the
> whole process works.
>
> I'm going to try disabling minor compactions as a start.
>
>> is the loading disk or cpu or network bound?
>
> cpu is at 40% free
> only one cassy node on the same box as the processor for now so no network
> traffic
> so I think it's disk access. Will find out for sure tomorrow after the
> current test runs.
>
> Thanks,
>
> Paul.
>
> On Thu, Aug 18, 2011 at 2:23 PM, Jake Luciani <jakers@gmail.com> wrote:
>>
>> Are you writing lots of tiny rows or a few very large rows, are you
>> batching mutations? is the loading disk or cpu or network bound?
>> -Jake
>> On Thu, Aug 18, 2011 at 7:08 AM, Paul Loy <keteracel@gmail.com> wrote:
>>>
>>> Hi All,
>>>
>>> I have a program that crunches through around 3 billion calculations. We
>>> store the result of each of these in cassandra to later query once in order
>>> to create some vectors. Our processing is limited by Cassandra now, rather
>>> than the calculations themselves.
>>>
>>> I was wondering what settings I can change to increase the write
>>> throughput. Perhaps disabling all caching, etc, as I won't be able to keep
>>> it all in memory anyway and only want to query the results once.
>>>
>>> Any thoughts would be appreciated,
>>>
>>> Paul.
>>>
>>> --
>>> ---------------------------------------------
>>> Paul Loy
>>> paul@keteracel.com
>>> http://uk.linkedin.com/in/paulloy
>>
>>
>>
>> --
>> http://twitter.com/tjake
>
>
>
> --
> ---------------------------------------------
> Paul Loy
> paul@keteracel.com
> http://uk.linkedin.com/in/paulloy
>



-- 
Jonathan Ellis
Project Chair, Apache Cassandra
co-founder of DataStax, the source for professional Cassandra support
http://www.datastax.com

Mime
View raw message