Yeah, we're processing item similarities. So we are writing single columns at a time. Although we do batch these into 400 mutations before sending to Cassy. We currently perform almost 2 billion calculations that then write almost 4 billion columns.
Once all similarities are calculated, we just grab a slice per item and create a denormalised vector of similar items (trimmed down to topN and only those above a certain threshold). This makes lookup super fast as we only get one column from cassandra.
So we just want to optimise the crunching and storing phase as that's a O(n^2) complexity problem. The quicker we can make that the quicker the whole process works.
I'm going to try disabling minor compactions as a start.
> is the loading disk or cpu or network bound?
cpu is at 40% free
only one cassy node on the same box as the processor for now so no network traffic
so I think it's disk access. Will find out for sure tomorrow after the current test runs.
Are you writing lots of tiny rows or a few very large rows, are you batching mutations? is the loading disk or cpu or network bound?-Jake--On Thu, Aug 18, 2011 at 7:08 AM, Paul Loy <email@example.com> wrote:
I have a program that crunches through around 3 billion calculations. We store the result of each of these in cassandra to later query once in order to create some vectors. Our processing is limited by Cassandra now, rather than the calculations themselves.
I was wondering what settings I can change to increase the write throughput. Perhaps disabling all caching, etc, as I won't be able to keep it all in memory anyway and only want to query the results once.
Any thoughts would be appreciated,