incubator-cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Paul Loy <ketera...@gmail.com>
Subject Re: Suggested settings for number crunching
Date Fri, 19 Aug 2011 09:14:15 GMT
Nice one thanks.

We're now up to 500k a second on one box which is pretty good (well good
enough until our data grows 5 fold). So maybe (un)durable_writes may speed
us up some more!!

Cheers,

Paul.

On Thu, Aug 18, 2011 at 11:40 PM, aaron morton <aaron@thelastpickle.com>wrote:

> couple of thoughts, 400 row mutations in a batch may be a bit high. More is
> not always better. Watch the TP stats to see if the mutation pool is backing
> up excessively.
>
> Also if you feel like having fun take a look at the durable_writes config
> setting for keyspaces, from the cli help…
> - durable_writes: When set to false all RowMutations on keyspace will
> by-pass CommitLog.
>   Set to true by default.
>
> This will remove disk access from the write path. Which sounds OK in your
> case.
>
> When you are doing the reads, the fastest slice predicate is one with no
> start, no finish, revered = false
> http://thelastpickle.com/2011/07/04/Cassandra-Query-Plans/) You can now
> reverse the storage ordered of comparators, so if you are getting cols from
> the end of the row consider changing the storage order.
>
> Cheers
>
> -----------------
> Aaron Morton
> Freelance Cassandra Developer
> @aaronmorton
> http://www.thelastpickle.com
>
> On 19/08/2011, at 3:43 AM, Paul Loy wrote:
>
> Yeah, the data after crunching drops to just 65000 columns so one Cassandra
> is plenty. That will all go in memory on one box. It's only the crunching
> where we have lots of data and then need it arranged in a structured manner.
> That's why I don't use flat files that I just append to. I need them in
> order of similarity to generate the vectors.
>
> Bulk loading looks interesting.
>
> On Thu, Aug 18, 2011 at 4:21 PM, Jake Luciani <jakers@gmail.com> wrote:
>
>> So you only have 1 cassandra node?
>>
>> If you are interested only in getting the complete work done as fast as
>> possible before you begin reading, take a look at the new bulk loader in
>> cassandra:
>>
>> http://www.datastax.com/dev/blog/bulk-loading
>>
>> -Jake
>>
>>
>> On Thu, Aug 18, 2011 at 11:03 AM, Paul Loy <keteracel@gmail.com> wrote:
>>
>>> Yeah, we're processing item similarities. So we are writing single
>>> columns at a time. Although we do batch these into 400 mutations before
>>> sending to Cassy. We currently perform almost 2 billion calculations that
>>> then write almost 4 billion columns.
>>>
>>> Once all similarities are calculated, we just grab a slice per item and
>>> create a denormalised vector of similar items (trimmed down to topN and only
>>> those above a certain threshold). This makes lookup super fast as we only
>>> get one column from cassandra.
>>>
>>> So we just want to optimise the crunching and storing phase as that's a
>>> O(n^2) complexity problem. The quicker we can make that the quicker the
>>> whole process works.
>>>
>>> I'm going to try disabling minor compactions as a start.
>>>
>>>
>>> > is the loading disk or cpu or network bound?
>>>
>>> cpu is at 40% free
>>> only one cassy node on the same box as the processor for now so no
>>> network traffic
>>> so I think it's disk access. Will find out for sure tomorrow after the
>>> current test runs.
>>>
>>> Thanks,
>>>
>>> Paul.
>>>
>>>
>>> On Thu, Aug 18, 2011 at 2:23 PM, Jake Luciani <jakers@gmail.com> wrote:
>>>
>>>> Are you writing lots of tiny rows or a few very large rows, are you
>>>> batching mutations? is the loading disk or cpu or network bound?
>>>>
>>>> -Jake
>>>>
>>>> On Thu, Aug 18, 2011 at 7:08 AM, Paul Loy <keteracel@gmail.com> wrote:
>>>>
>>>>> Hi All,
>>>>>
>>>>> I have a program that crunches through around 3 billion calculations.
>>>>> We store the result of each of these in cassandra to later query once
in
>>>>> order to create some vectors. Our processing is limited by Cassandra
now,
>>>>> rather than the calculations themselves.
>>>>>
>>>>> I was wondering what settings I can change to increase the write
>>>>> throughput. Perhaps disabling all caching, etc, as I won't be able to
keep
>>>>> it all in memory anyway and only want to query the results once.
>>>>>
>>>>> Any thoughts would be appreciated,
>>>>>
>>>>> Paul.
>>>>>
>>>>> --
>>>>> ---------------------------------------------
>>>>> Paul Loy
>>>>> paul@keteracel.com
>>>>> http://uk.linkedin.com/in/paulloy
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> http://twitter.com/tjake
>>>>
>>>
>>>
>>>
>>> --
>>> ---------------------------------------------
>>> Paul Loy
>>> paul@keteracel.com
>>> http://uk.linkedin.com/in/paulloy
>>>
>>
>>
>>
>> --
>> http://twitter.com/tjake
>>
>
>
>
> --
> ---------------------------------------------
> Paul Loy
> paul@keteracel.com
> http://uk.linkedin.com/in/paulloy
>
>
>


-- 
---------------------------------------------
Paul Loy
paul@keteracel.com
http://uk.linkedin.com/in/paulloy

Mime
View raw message