incubator-cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From aaron morton <aa...@thelastpickle.com>
Subject Re: Reading data in bulk from cassandra for indexing in Elastic search
Date Sun, 31 Mar 2013 09:46:36 GMT
> Approach 1:
> 1. Get chucks of 10,000 keys (which is configurable, but when I increase it to more than
15,000, I get a thrift frame size error cassandra. To fix it, I will need to increase that
frame size via cassandra.yml)  and its columns (around 15 columns/key).
> 
You can model this on the way the Hadoop ColumnFamilyRecordReader works. Run it in parallel
on every node in the cluster, have each process only read the rows which are in the primary
token range for the node it's running on. For the first range_slice query use the token range
for the node, for the subsequent queries convert the last row key to a token and use that
as the start token. 

IMHO 10K rows per slice is too many, I would start at 1K. More is not always better. 
 
> 1. What is the suggest strategy to read bulk data from cassandra? Which read pattern
is better, one big get range slide with 10,000 keys-columns or multiple small GETs for every
keys?
Somewhere there is a sweet spot. Big queries hurt overall query throughput on the nodes and
can lead to memory/GC issues on the client and servers. Lots of small queries result in more
time spent waiting for network latency. Start small and find the point where the overall throughput
stops improving, then make sure you are not hurting the throughput for other clients. 
 
> 2. How about reading more values at once, say 50,000 keys-columns by increasing the thrift
frame size from 16Mb to something greater like 54MB? How will it impact cassandra's performance
in general?
It will result in increased GC pressure.

Cheers

-----------------
Aaron Morton
Freelance Cassandra Consultant
New Zealand

@aaronmorton
http://www.thelastpickle.com

On 28/03/2013, at 1:44 PM, Utkarsh Sengar <utkarsh2012@gmail.com> wrote:

> Hello,
> 
> I am trying to implement an indexer for a column family in cassandra (cluster of 4 nodes)
using elastic search. There is a river plugin which I am writing which retrieves data from
cassandra and throws to elastic search. It is triggered once a day (which is configurable
based on the requirement).
> 
> Total keys: ~50M
> 
> So for reading the whole column family (random partition), I am going ahead with this
approach:
> As mentioned here, I use this example (PaginateGetRangeSlices.java):
> 
> Approach 1:
> 1. Get chucks of 10,000 keys (which is configurable, but when I increase it to more than
15,000, I get a thrift frame size error cassandra. To fix it, I will need to increase that
frame size via cassandra.yml)  and its columns (around 15 columns/key).
> 2. Then send 15,000 read records to elastic search.
> 3. It is single threaded for now. It will be hard to make this multithreaded because
I will need to track the range of keys which is already read and share start key value. with
every thread. Think PaginateGetRangeSlices.java example, but multi-threaded.
> 
> I have implemented this approach, its not that fast. Takes about 6hours to complete.
> 
> Approach 2:
> 1. Get all the keys using same query as above. But retrieve only the key.
> 2. Divide the keys by x. Where x will the total threads I spawn. Every individual thread
will do an individual GET for a key and insert it in elastic search. This will considerably
increase hits to cassandra, but sounds more efficient.
> 
> 
> So my questions are:
> 1. What is the suggest strategy to read bulk data from cassandra? Which read pattern
is better, one big get range slide with 10,000 keys-columns or multiple small GETs for every
keys?
> 
> 2. How about reading more values at once, say 50,000 keys-columns by increasing the thrift
frame size from 16Mb to something greater like 54MB? How will it impact cassandra's performance
in general?
> 
> Will appreciate your input about any other strategies you use to move bulk data from
cassandra.
> 
> -- 
> Thanks,
> -Utkarsh


Mime
View raw message