incubator-cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jan Algermissen <>
Subject MapReduce response time and speed
Date Wed, 24 Jul 2013 14:33:15 GMT

I am Jan Algermissen (REST-head, freelance programmer/consultant) and Cassandra-newbie.

I am looking at Cassandra for an application I am working on. There will be a max. of 10 Million
items (Texts and attributes of a retailer's products) in the database. There will occasional
writes (e.g. price updates).

The use case for the application is to work on the whole data set, item by item to produce
'exports'. It will be neccessary to access the full set every time. There is no relationship
between the items. Processing is done iteratively.

My question: I am thinking that this is an ideal scenario for map-reduce but I am unsure about
two things:

Can a user of the system define new jobs in an ad-hoc fashion (like a query) or do map reduce
jobs need to be prepared by a developer (e.g. in RIAK you do a developer to compile-in the
job when you need the perormance of Erlang-based jobs).

Suppose a user indeed can specify a job and send it off to Cassandra for processing, what
is the expected response time?

Is it possible to reduce the response time (by tuning, adding more nodes) to make a result
available within a couple of minutes? Or will there most certainly be a gap of 10 minutes
or so and more?

I understand that map-reduce is not for ad-hoc 'querying', but my users expect the system
to feel quasi-ineractive, because they intend to refine the processing job based on the results
they get. A short gap would be ok, but a definite gap in the order of 10+ minutes not.

(For example, as far as I learned with RIAK you would most certainly have such a gap. How
about Cassandra? Throwing more nodes at the problem would be ok, I just need to understand
whether there is a definite 'response time penalty' I have to expect no matter what)


View raw message