incubator-cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Alexandru Sicoe <adsi...@gmail.com>
Subject Re: Querying all keys in a column family
Date Fri, 24 Feb 2012 13:29:44 GMT
Hi Aaron and Martin,

Sorry about my previous reply, I thought you wanted to process only all the
row keys in CF.

I have a similar issue as Martin because I see myself being forced to hit
more than a million rows with a query (I only get a few columns from every
row). Aaron, we've talked about this in another thread, basically I am
constrained to ship out a window of data from my online cluster to an
offline cluster. For this I need to read for example 5 min window of all
the data I have. This simply accesses too many rows and I am hitting the
I/O limit on the nodes. As I understand for every row it will do 2 random
disk seeks (I have no caches).

My question is, what can I do to improve the performance of shipping
windows of data entirely out?

Martin, did you use Hadoop as Aaron suggested? How did that work with
Cassandra? I don't understand how accessing 1 million of rows through map
reduce jobs be any faster?

Cheers,
Alexandru


On Tue, Feb 14, 2012 at 10:00 AM, aaron morton <aaron@thelastpickle.com>wrote:

> If you want to process 1 million rows use Hadoop with Hive or Pig. If you
> use Hadoop you are not doing things in real time.
>
> You may need to rephrase the problem.
>
> Cheers
>
> -----------------
> Aaron Morton
> Freelance Developer
> @aaronmorton
> http://www.thelastpickle.com
>
> On 14/02/2012, at 11:00 AM, Martin Arrowsmith wrote:
>
> Hi Experts,
>
> My program is such that it queries all keys on Cassandra. I want to do
> this as quick as possible, in order to get as close to real-time as
> possible.
>
> One solution I heard was to use the sstables2json tool, and read the data
> in as JSON. I understand that reading from each line in Cassandra might
> take longer.
>
> Are there any other ideas for doing this ? Or can you confirm that
> sstables2json is the way to go.
>
> Querying 100 rows in Cassandra the normal way is fast enough. I'd like to
> query a million rows, do some calculations on them, and spit out the result
> like it's real time.
>
> Thanks for any help you can give,
>
> Martin
>
>
>

Mime
View raw message