incubator-cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Adrian Cockcroft <adrian.cockcr...@gmail.com>
Subject Re: batch dump of data from cassandra?
Date Mon, 23 May 2011 16:05:36 GMT
Three ways to do this.

Client app does get key for every row, lots of small network operations

brisk / hive does select(*), which is sent to each node to map then
the hadoop network shuffle merges the result

Write your own code to merge all the SStables across the cluster.

So I think that brisk is going to be easier to implement but also
closer in efficiency to the way you want to do it.

Adrian

On Monday, May 23, 2011, Yang <teddyyyy123@gmail.com> wrote:
> thanks Sri
>
> I am trying to make sure that Brisk underneath does a simple scraping
> of the rows, instead of doing foreach key ( keys ) { lookup (key) }..
> after that, I can feel comfortable using Brisk for the import/export jobs
>
> yang
>
> On Mon, May 23, 2011 at 8:50 AM, SriSatish Ambati
> <srisatish@datastax.com> wrote:
>> Adrian,
>> +1
>> Using hive & hadoop for the export-import of data from & to Cassandra is
one
>> of the original use cases we had in mind for Brisk. That also has the
>> ability to parallelize the workload and finish rapidly.
>> thanks,
>> Sri
>> On Sun, May 22, 2011 at 11:31 PM, Adrian Cockcroft
>> <adrian.cockcroft@gmail.com> wrote:
>>>
>>> Hi Yang,
>>>
>>> You could also use Hadoop (i.e. Brisk), and run a MapReduce job or
>>> Hive query to extract and summarize/renormalize the data into whatever
>>> format you like.
>>>
>>> If you use sstable2json, you have to run on every file on every node,
>>> deduplicate/merge all the output across machines, which is what MR
>>> does anyway.
>>>
>>> Our data flow is to take backups of a production cluster, restore a
>>> backup to a different cluster running Hadoop, then run our point in
>>> time data extraction for ETL processing by the BI team. The
>>> backup/restore gives a frozen in time (consistent to a second or so)
>>> cluster for extraction. Running live with Brisk means you are running
>>> your extraction over a moving target.
>>>
>>> Adrian
>>>
>>> On Sun, May 22, 2011 at 11:14 PM, Yang <teddyyyy123@gmail.com> wrote:
>>> > Thanks Jonathan.
>>> >
>>> > On Sun, May 22, 2011 at 9:56 PM, Jonathan Ellis <jbellis@gmail.com>
>>> > wrote:
>>> >> I'd modify SSTableExport.serializeRow (the sstable2json class) to
>>> >> output to whatever system you are targeting.
>>> >>
>>> >> On Sun, May 22, 2011 at 11:19 PM, Yang <teddyyyy123@gmail.com>
wrote:
>>> >>> let's say periodically (daily) I need to dump out the contents of
my
>>> >>> Cassandra DB, and do a import into oracle , or some other custom
data
>>> >>> stores,
>>> >>> is there a way to do it?
>>> >>>
>>> >>> I checked that you can do multi-get() but you probably can't pass
the
>>> >>> entire key domain into the API, cuz the entire db would be returned
on
>>> >>> a single thrift call, and probably overflow the
>>> >>> API? plus multi-get underneath just sends out per-key lookups one
by
>>> >>> one, while I really do not care about which key corresponds to which
>>> >>> result, a simple scraping of the underlying SSTable would
>>> >>> be perfect, because I could utilize the file cache coherency as
I read
>>> >>> down the file.
>>> >>>
>>> >>>
>>> >>> Thanks
>>> >>> Yang
>>> >>>
>>> >>
>>> >>
>>> >>
>>> >> --
>>> >> Jonathan Ellis
>>> >> Project Chair, Apache Cassandra
>>> >> co-founder of DataStax, the source for professional Cassandra support
>>> >> http://www.datastax.com
>>> >>
>>> >
>>
>>
>>
>> --
>> SriSatish Ambati
>> Director of Engineering, DataStax
>> @srisatish
>>
>>
>>
>>
>>
>

Mime
View raw message