Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@cassandra.apache.org
Received-SPF: pass (nike.apache.org: domain of adrian.cockcroft@gmail.com
 designates 209.85.214.44 as permitted sender)
DomainKey-Signature: a=rsa-sha1; c=nofws;
        d=gmail.com; s=gamma;
        h=mime-version:in-reply-to:references:date:message-id:subject:from:to
         :content-type:content-transfer-encoding;
        b=UjS7C73pI6IOyp+PqO3L5bT2NyfjQgd1SRGZr0fEvWedHNVvONp2QVoff8/jZU9CQt
         JM8S+GpmH2AMZe/LwuYAfg8si1NsQzzOam4M80Fcvq2dG28sGsdCBsRPbB9GKAzLIeQy
         k0K6z3QwkCeJ9goizJJG0HOcP6B6E9l34gukM=
MIME-Version: 1.0
In-Reply-To: <BANLkTinQKiSOcMi+SPstEgvtewqHKjLw9Q@mail.gmail.com>
References: <BANLkTimYBM+EQEbO7EiCViYwmv0c87f8JA@mail.gmail.com>
	<BANLkTin5h65vQTZuif8fqana3SM3J4YiNg@mail.gmail.com>
	<BANLkTikCJ80EvHi8xQg2vmsJDHJH4Gcg8Q@mail.gmail.com>
	<BANLkTinpuDiFwmU6OpvRk=5vBoD4bxp0iA@mail.gmail.com>
	<BANLkTi=UxSry7w6q4bpSvt6DMmLxCgwVNg@mail.gmail.com>
	<BANLkTinQKiSOcMi+SPstEgvtewqHKjLw9Q@mail.gmail.com>
Date: Mon, 23 May 2011 09:05:36 -0700
Message-ID: <BANLkTi=y_C2FRjY-oFvQzXc8FtxjU93xDQ@mail.gmail.com>
Subject: Re: batch dump of data from cassandra?
From: Adrian Cockcroft <adrian.cockcroft@gmail.com>
To: "user@cassandra.apache.org" <user@cassandra.apache.org>
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

Three ways to do this.

Client app does get key for every row, lots of small network operations

brisk / hive does select(*), which is sent to each node to map then
the hadoop network shuffle merges the result

Write your own code to merge all the SStables across the cluster.

So I think that brisk is going to be easier to implement but also
closer in efficiency to the way you want to do it.

Adrian

On Monday, May 23, 2011, Yang <teddyyyy123@gmail.com> wrote:
> thanks Sri
>
> I am trying to make sure that Brisk underneath does a simple scraping
> of the rows, instead of doing foreach key ( keys ) { lookup (key) }..
> after that, I can feel comfortable using Brisk for the import/export jobs
>
> yang
>
> On Mon, May 23, 2011 at 8:50 AM, SriSatish Ambati
> <srisatish@datastax.com> wrote:
>> Adrian,
>> +1
>> Using hive & hadoop for the export-import of data from & to Cassandra is=
 one
>> of the original use cases we had in mind for Brisk.=A0That also has the
>> ability to parallelize the workload and finish rapidly.
>> thanks,
>> Sri
>> On Sun, May 22, 2011 at 11:31 PM, Adrian Cockcroft
>> <adrian.cockcroft@gmail.com> wrote:
>>>
>>> Hi Yang,
>>>
>>> You could also use Hadoop (i.e. Brisk), and run a MapReduce job or
>>> Hive query to extract and summarize/renormalize the data into whatever
>>> format you like.
>>>
>>> If you use sstable2json, you have to run on every file on every node,
>>> deduplicate/merge all the output across machines, which is what MR
>>> does anyway.
>>>
>>> Our data flow is to take backups of a production cluster, restore a
>>> backup to a different cluster running Hadoop, then run our point in
>>> time data extraction for ETL processing by the BI team. The
>>> backup/restore gives a frozen in time (consistent to a second or so)
>>> cluster for extraction. Running live with Brisk means you are running
>>> your extraction over a moving target.
>>>
>>> Adrian
>>>
>>> On Sun, May 22, 2011 at 11:14 PM, Yang <teddyyyy123@gmail.com> wrote:
>>> > Thanks Jonathan.
>>> >
>>> > On Sun, May 22, 2011 at 9:56 PM, Jonathan Ellis <jbellis@gmail.com>
>>> > wrote:
>>> >> I'd modify SSTableExport.serializeRow (the sstable2json class) to
>>> >> output to whatever system you are targeting.
>>> >>
>>> >> On Sun, May 22, 2011 at 11:19 PM, Yang <teddyyyy123@gmail.com> wrote=
:
>>> >>> let's say periodically (daily) I need to dump out the contents of m=
y
>>> >>> Cassandra DB, and do a import into oracle , or some other custom da=
ta
>>> >>> stores,
>>> >>> is there a way to do it?
>>> >>>
>>> >>> I checked that you can do multi-get() but you probably can't pass t=
he
>>> >>> entire key domain into the API, cuz the entire db would be returned=
 on
>>> >>> a single thrift call, and probably overflow the
>>> >>> API? plus multi-get underneath just sends out per-key lookups one b=
y
>>> >>> one, while I really do not care about which key corresponds to whic=
h
>>> >>> result, a simple scraping of the underlying SSTable would
>>> >>> be perfect, because I could utilize the file cache coherency as I r=
ead
>>> >>> down the file.
>>> >>>
>>> >>>
>>> >>> Thanks
>>> >>> Yang
>>> >>>
>>> >>
>>> >>
>>> >>
>>> >> --
>>> >> Jonathan Ellis
>>> >> Project Chair, Apache Cassandra
>>> >> co-founder of DataStax, the source for professional Cassandra suppor=
t
>>> >> http://www.datastax.com
>>> >>
>>> >
>>
>>
>>
>> --
>> SriSatish Ambati
>> Director of Engineering, DataStax
>> @srisatish
>>
>>
>>
>>
>>
>