cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Paul Loy <ketera...@gmail.com>
Subject Re: Suggested settings for number crunching
Date Thu, 18 Aug 2011 15:37:49 GMT
Yup, we do that. We currently have 200 threads that push mutations into a
pool of Mutators (think Pelops - although that was too slow so we rolled our
own much lower level version). We have around 50 thrift clients that
mutations are them pushed through to cassandra.

On Thu, Aug 18, 2011 at 4:35 PM, Jonathan Ellis <jbellis@gmail.com> wrote:

> Step 0: use multiple threads to insert
>
> On Thu, Aug 18, 2011 at 10:03 AM, Paul Loy <keteracel@gmail.com> wrote:
> > Yeah, we're processing item similarities. So we are writing single
> columns
> > at a time. Although we do batch these into 400 mutations before sending
> to
> > Cassy. We currently perform almost 2 billion calculations that then write
> > almost 4 billion columns.
> >
> > Once all similarities are calculated, we just grab a slice per item and
> > create a denormalised vector of similar items (trimmed down to topN and
> only
> > those above a certain threshold). This makes lookup super fast as we only
> > get one column from cassandra.
> >
> > So we just want to optimise the crunching and storing phase as that's a
> > O(n^2) complexity problem. The quicker we can make that the quicker the
> > whole process works.
> >
> > I'm going to try disabling minor compactions as a start.
> >
> >> is the loading disk or cpu or network bound?
> >
> > cpu is at 40% free
> > only one cassy node on the same box as the processor for now so no
> network
> > traffic
> > so I think it's disk access. Will find out for sure tomorrow after the
> > current test runs.
> >
> > Thanks,
> >
> > Paul.
> >
> > On Thu, Aug 18, 2011 at 2:23 PM, Jake Luciani <jakers@gmail.com> wrote:
> >>
> >> Are you writing lots of tiny rows or a few very large rows, are you
> >> batching mutations? is the loading disk or cpu or network bound?
> >> -Jake
> >> On Thu, Aug 18, 2011 at 7:08 AM, Paul Loy <keteracel@gmail.com> wrote:
> >>>
> >>> Hi All,
> >>>
> >>> I have a program that crunches through around 3 billion calculations.
> We
> >>> store the result of each of these in cassandra to later query once in
> order
> >>> to create some vectors. Our processing is limited by Cassandra now,
> rather
> >>> than the calculations themselves.
> >>>
> >>> I was wondering what settings I can change to increase the write
> >>> throughput. Perhaps disabling all caching, etc, as I won't be able to
> keep
> >>> it all in memory anyway and only want to query the results once.
> >>>
> >>> Any thoughts would be appreciated,
> >>>
> >>> Paul.
> >>>
> >>> --
> >>> ---------------------------------------------
> >>> Paul Loy
> >>> paul@keteracel.com
> >>> http://uk.linkedin.com/in/paulloy
> >>
> >>
> >>
> >> --
> >> http://twitter.com/tjake
> >
> >
> >
> > --
> > ---------------------------------------------
> > Paul Loy
> > paul@keteracel.com
> > http://uk.linkedin.com/in/paulloy
> >
>
>
>
> --
> Jonathan Ellis
> Project Chair, Apache Cassandra
> co-founder of DataStax, the source for professional Cassandra support
> http://www.datastax.com
>



-- 
---------------------------------------------
Paul Loy
paul@keteracel.com
http://uk.linkedin.com/in/paulloy

Mime
View raw message