cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Martin Arrowsmith <>
Subject Re: Querying all keys in a column family
Date Sun, 26 Feb 2012 01:21:45 GMT
Hi Alexandru,

Things got hectic and I put off the project until this weekend. I'm
actually learning about Hadoop right now and how to implement it. I can
respond to this thread when I have something running.

In the meantime, I'd like to bump this email up and see if there are others
who can provide some feedback. 1) Will Hadoop speed up the time to read all
the rows? 2) Are there other options?

My guess was that hadoop could split up your jobs, so each node could
handle a portion of the query. For instance, having 2 nodes would do the
job twice as fast. That is my naive guess though and could be far from the

Best wishes,


On Fri, Feb 24, 2012 at 5:29 AM, Alexandru Sicoe <> wrote:

> Hi Aaron and Martin,
> Sorry about my previous reply, I thought you wanted to process only all
> the row keys in CF.
> I have a similar issue as Martin because I see myself being forced to hit
> more than a million rows with a query (I only get a few columns from every
> row). Aaron, we've talked about this in another thread, basically I am
> constrained to ship out a window of data from my online cluster to an
> offline cluster. For this I need to read for example 5 min window of all
> the data I have. This simply accesses too many rows and I am hitting the
> I/O limit on the nodes. As I understand for every row it will do 2 random
> disk seeks (I have no caches).
> My question is, what can I do to improve the performance of shipping
> windows of data entirely out?
> Martin, did you use Hadoop as Aaron suggested? How did that work with
> Cassandra? I don't understand how accessing 1 million of rows through map
> reduce jobs be any faster?
> Cheers,
> Alexandru
> On Tue, Feb 14, 2012 at 10:00 AM, aaron morton <>wrote:
>> If you want to process 1 million rows use Hadoop with Hive or Pig. If you
>> use Hadoop you are not doing things in real time.
>> You may need to rephrase the problem.
>> Cheers
>>   -----------------
>> Aaron Morton
>> Freelance Developer
>> @aaronmorton
>> On 14/02/2012, at 11:00 AM, Martin Arrowsmith wrote:
>> Hi Experts,
>> My program is such that it queries all keys on Cassandra. I want to do
>> this as quick as possible, in order to get as close to real-time as
>> possible.
>> One solution I heard was to use the sstables2json tool, and read the data
>> in as JSON. I understand that reading from each line in Cassandra might
>> take longer.
>> Are there any other ideas for doing this ? Or can you confirm that
>> sstables2json is the way to go.
>> Querying 100 rows in Cassandra the normal way is fast enough. I'd like to
>> query a million rows, do some calculations on them, and spit out the result
>> like it's real time.
>> Thanks for any help you can give,
>> Martin

View raw message