incubator-cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Yang <teddyyyy...@gmail.com>
Subject Re: batch dump of data from cassandra?
Date Mon, 23 May 2011 15:57:59 GMT
thanks Sri

I am trying to make sure that Brisk underneath does a simple scraping
of the rows, instead of doing foreach key ( keys ) { lookup (key) }..
after that, I can feel comfortable using Brisk for the import/export jobs

yang

On Mon, May 23, 2011 at 8:50 AM, SriSatish Ambati
<srisatish@datastax.com> wrote:
> Adrian,
> +1
> Using hive & hadoop for the export-import of data from & to Cassandra is one
> of the original use cases we had in mind for Brisk. That also has the
> ability to parallelize the workload and finish rapidly.
> thanks,
> Sri
> On Sun, May 22, 2011 at 11:31 PM, Adrian Cockcroft
> <adrian.cockcroft@gmail.com> wrote:
>>
>> Hi Yang,
>>
>> You could also use Hadoop (i.e. Brisk), and run a MapReduce job or
>> Hive query to extract and summarize/renormalize the data into whatever
>> format you like.
>>
>> If you use sstable2json, you have to run on every file on every node,
>> deduplicate/merge all the output across machines, which is what MR
>> does anyway.
>>
>> Our data flow is to take backups of a production cluster, restore a
>> backup to a different cluster running Hadoop, then run our point in
>> time data extraction for ETL processing by the BI team. The
>> backup/restore gives a frozen in time (consistent to a second or so)
>> cluster for extraction. Running live with Brisk means you are running
>> your extraction over a moving target.
>>
>> Adrian
>>
>> On Sun, May 22, 2011 at 11:14 PM, Yang <teddyyyy123@gmail.com> wrote:
>> > Thanks Jonathan.
>> >
>> > On Sun, May 22, 2011 at 9:56 PM, Jonathan Ellis <jbellis@gmail.com>
>> > wrote:
>> >> I'd modify SSTableExport.serializeRow (the sstable2json class) to
>> >> output to whatever system you are targeting.
>> >>
>> >> On Sun, May 22, 2011 at 11:19 PM, Yang <teddyyyy123@gmail.com> wrote:
>> >>> let's say periodically (daily) I need to dump out the contents of my
>> >>> Cassandra DB, and do a import into oracle , or some other custom data
>> >>> stores,
>> >>> is there a way to do it?
>> >>>
>> >>> I checked that you can do multi-get() but you probably can't pass the
>> >>> entire key domain into the API, cuz the entire db would be returned
on
>> >>> a single thrift call, and probably overflow the
>> >>> API? plus multi-get underneath just sends out per-key lookups one by
>> >>> one, while I really do not care about which key corresponds to which
>> >>> result, a simple scraping of the underlying SSTable would
>> >>> be perfect, because I could utilize the file cache coherency as I read
>> >>> down the file.
>> >>>
>> >>>
>> >>> Thanks
>> >>> Yang
>> >>>
>> >>
>> >>
>> >>
>> >> --
>> >> Jonathan Ellis
>> >> Project Chair, Apache Cassandra
>> >> co-founder of DataStax, the source for professional Cassandra support
>> >> http://www.datastax.com
>> >>
>> >
>
>
>
> --
> SriSatish Ambati
> Director of Engineering, DataStax
> @srisatish
>
>
>
>
>

Mime
View raw message