cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From aaron morton <aa...@thelastpickle.com>
Subject Re: Querying all keys in a column family
Date Sun, 26 Feb 2012 20:10:57 GMT
When you say query 1 million records in my mind i'm saying "dump 1 million records to another
system as a back office job".
 
Hadoop will split the job over multiple nodes and will assign a task to read the range "owned"
by each node. From memory it uses CL ONE (by default) for the read so the node it is connected
to is the only one involved in the read.  Also the task can be run on the node rather than
off node. 

This does not magic up up some new IO capacity though. It will spread the work load so to
add IO capacity add nodes.  

You could do something similar by reducing the CL level and querying through the thrift interface.
Then only ask a node for data in the key range it "owns". 

If this does not help the next step is to borrow from the ideas in Data Stax Brisk (now Data
Stax Enterprise). Use the NetworkTopologyStrategy and two data centres (or a Virtual Data
Centre http://wiki.apache.org/cassandra/HadoopSupport). 

One DC is for OLTP and the other for OLAP / Export. The OLTP side will be able to run without
interruption from the OLAP side. 

Another option is use something like Kafka and fork the data stream, send it to cassandra
and the external system at the same time. 

Hope that helps. 

-----------------
Aaron Morton
Freelance Developer
@aaronmorton
http://www.thelastpickle.com

On 26/02/2012, at 2:21 PM, Martin Arrowsmith wrote:

> Hi Alexandru,
> 
> Things got hectic and I put off the project until this weekend. I'm actually learning
about Hadoop right now and how to implement it. I can respond to this thread when I have something
running.
> 
> In the meantime, I'd like to bump this email up and see if there are others who can provide
some feedback. 1) Will Hadoop speed up the time to read all the rows? 2) Are there other options?
> 
> My guess was that hadoop could split up your jobs, so each node could handle a portion
of the query. For instance, having 2 nodes would do the job twice as fast. That is my naive
guess though and could be far from the truth.
> 
> Best wishes,
> 
> Martin
> 
> On Fri, Feb 24, 2012 at 5:29 AM, Alexandru Sicoe <adsicoe@gmail.com> wrote:
> Hi Aaron and Martin,
> 
> Sorry about my previous reply, I thought you wanted to process only all the row keys
in CF.
> 
> I have a similar issue as Martin because I see myself being forced to hit more than a
million rows with a query (I only get a few columns from every row). Aaron, we've talked about
this in another thread, basically I am constrained to ship out a window of data from my online
cluster to an offline cluster. For this I need to read for example 5 min window of all the
data I have. This simply accesses too many rows and I am hitting the I/O limit on the nodes.
As I understand for every row it will do 2 random disk seeks (I have no caches).
> 
> My question is, what can I do to improve the performance of shipping windows of data
entirely out?
> 
> Martin, did you use Hadoop as Aaron suggested? How did that work with Cassandra? I don't
understand how accessing 1 million of rows through map reduce jobs be any faster?
> 
> Cheers,
> Alexandru
> 
>  
> 
> On Tue, Feb 14, 2012 at 10:00 AM, aaron morton <aaron@thelastpickle.com> wrote:
> If you want to process 1 million rows use Hadoop with Hive or Pig. If you use Hadoop
you are not doing things in real time. 
> 
> You may need to rephrase the problem. 
> 
> Cheers
> 
> -----------------
> Aaron Morton
> Freelance Developer
> @aaronmorton
> http://www.thelastpickle.com
> 
> On 14/02/2012, at 11:00 AM, Martin Arrowsmith wrote:
> 
>> Hi Experts,
>> 
>> My program is such that it queries all keys on Cassandra. I want to do this as quick
as possible, in order to get as close to real-time as possible.
>> 
>> One solution I heard was to use the sstables2json tool, and read the data in as JSON.
I understand that reading from each line in Cassandra might take longer.
>> 
>> Are there any other ideas for doing this ? Or can you confirm that sstables2json
is the way to go.
>> 
>> Querying 100 rows in Cassandra the normal way is fast enough. I'd like to query a
million rows, do some calculations on them, and spit out the result like it's real time.
>> 
>> Thanks for any help you can give,
>> 
>> Martin
> 
> 
> 


Mime
View raw message