incubator-cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From SriSatish Ambati <srisat...@datastax.com>
Subject Re: batch dump of data from cassandra?
Date Mon, 23 May 2011 15:50:33 GMT
Adrian,
+1
Using hive & hadoop for the export-import of data from & to Cassandra is one
of the original use cases we had in mind for Brisk. That also has the
ability to parallelize the workload and finish rapidly.

thanks,
Sri

On Sun, May 22, 2011 at 11:31 PM, Adrian Cockcroft <
adrian.cockcroft@gmail.com> wrote:

> Hi Yang,
>
> You could also use Hadoop (i.e. Brisk), and run a MapReduce job or
> Hive query to extract and summarize/renormalize the data into whatever
> format you like.
>
> If you use sstable2json, you have to run on every file on every node,
> deduplicate/merge all the output across machines, which is what MR
> does anyway.
>
> Our data flow is to take backups of a production cluster, restore a
> backup to a different cluster running Hadoop, then run our point in
> time data extraction for ETL processing by the BI team. The
> backup/restore gives a frozen in time (consistent to a second or so)
> cluster for extraction. Running live with Brisk means you are running
> your extraction over a moving target.
>
> Adrian
>
> On Sun, May 22, 2011 at 11:14 PM, Yang <teddyyyy123@gmail.com> wrote:
> > Thanks Jonathan.
> >
> > On Sun, May 22, 2011 at 9:56 PM, Jonathan Ellis <jbellis@gmail.com>
> wrote:
> >> I'd modify SSTableExport.serializeRow (the sstable2json class) to
> >> output to whatever system you are targeting.
> >>
> >> On Sun, May 22, 2011 at 11:19 PM, Yang <teddyyyy123@gmail.com> wrote:
> >>> let's say periodically (daily) I need to dump out the contents of my
> >>> Cassandra DB, and do a import into oracle , or some other custom data
> >>> stores,
> >>> is there a way to do it?
> >>>
> >>> I checked that you can do multi-get() but you probably can't pass the
> >>> entire key domain into the API, cuz the entire db would be returned on
> >>> a single thrift call, and probably overflow the
> >>> API? plus multi-get underneath just sends out per-key lookups one by
> >>> one, while I really do not care about which key corresponds to which
> >>> result, a simple scraping of the underlying SSTable would
> >>> be perfect, because I could utilize the file cache coherency as I read
> >>> down the file.
> >>>
> >>>
> >>> Thanks
> >>> Yang
> >>>
> >>
> >>
> >>
> >> --
> >> Jonathan Ellis
> >> Project Chair, Apache Cassandra
> >> co-founder of DataStax, the source for professional Cassandra support
> >> http://www.datastax.com
> >>
> >
>



-- 
SriSatish Ambati
Director of Engineering, DataStax
@srisatish

Mime
View raw message