Return-Path: X-Original-To: apmail-cassandra-user-archive@www.apache.org Delivered-To: apmail-cassandra-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id A79F94087 for ; Mon, 23 May 2011 16:06:05 +0000 (UTC) Received: (qmail 83549 invoked by uid 500); 23 May 2011 16:06:03 -0000 Delivered-To: apmail-cassandra-user-archive@cassandra.apache.org Received: (qmail 83526 invoked by uid 500); 23 May 2011 16:06:03 -0000 Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@cassandra.apache.org Delivered-To: mailing list user@cassandra.apache.org Received: (qmail 83518 invoked by uid 99); 23 May 2011 16:06:03 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 23 May 2011 16:06:03 +0000 X-ASF-Spam-Status: No, hits=-0.7 required=5.0 tests=FREEMAIL_FROM,RCVD_IN_DNSWL_LOW,RFC_ABUSE_POST,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of adrian.cockcroft@gmail.com designates 209.85.214.44 as permitted sender) Received: from [209.85.214.44] (HELO mail-bw0-f44.google.com) (209.85.214.44) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 23 May 2011 16:05:56 +0000 Received: by bwz13 with SMTP id 13so5632281bwz.31 for ; Mon, 23 May 2011 09:05:36 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:in-reply-to:references:date :message-id:subject:from:to:content-type:content-transfer-encoding; bh=uH6DCpY8D1j0p92h9Fw7e3XQ6eWsLayCCJmpI5gt6nE=; b=slKlKD8B+M1M+D6LT844Bb99pvzB1BadDL2LhUdd7BQvU6sNFO+1zVq3HiLT5i5hvM 8Cmmj5zA+pjpNkLkMe1ZDIJ6ae/Z0XKCEOuU8NAeQFFHf/KlR1FD06vMr9xlU9t506fk zYTDHAw7X9A9UOxJhFRE+FAVqIvlVrNsXggUY= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type:content-transfer-encoding; b=UjS7C73pI6IOyp+PqO3L5bT2NyfjQgd1SRGZr0fEvWedHNVvONp2QVoff8/jZU9CQt JM8S+GpmH2AMZe/LwuYAfg8si1NsQzzOam4M80Fcvq2dG28sGsdCBsRPbB9GKAzLIeQy k0K6z3QwkCeJ9goizJJG0HOcP6B6E9l34gukM= MIME-Version: 1.0 Received: by 10.204.24.4 with SMTP id t4mr2398494bkb.109.1306166736548; Mon, 23 May 2011 09:05:36 -0700 (PDT) Received: by 10.204.166.129 with HTTP; Mon, 23 May 2011 09:05:36 -0700 (PDT) In-Reply-To: References: Date: Mon, 23 May 2011 09:05:36 -0700 Message-ID: Subject: Re: batch dump of data from cassandra? From: Adrian Cockcroft To: "user@cassandra.apache.org" Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable X-Virus-Checked: Checked by ClamAV on apache.org Three ways to do this. Client app does get key for every row, lots of small network operations brisk / hive does select(*), which is sent to each node to map then the hadoop network shuffle merges the result Write your own code to merge all the SStables across the cluster. So I think that brisk is going to be easier to implement but also closer in efficiency to the way you want to do it. Adrian On Monday, May 23, 2011, Yang wrote: > thanks Sri > > I am trying to make sure that Brisk underneath does a simple scraping > of the rows, instead of doing foreach key ( keys ) { lookup (key) }.. > after that, I can feel comfortable using Brisk for the import/export jobs > > yang > > On Mon, May 23, 2011 at 8:50 AM, SriSatish Ambati > wrote: >> Adrian, >> +1 >> Using hive & hadoop for the export-import of data from & to Cassandra is= one >> of the original use cases we had in mind for Brisk.=A0That also has the >> ability to parallelize the workload and finish rapidly. >> thanks, >> Sri >> On Sun, May 22, 2011 at 11:31 PM, Adrian Cockcroft >> wrote: >>> >>> Hi Yang, >>> >>> You could also use Hadoop (i.e. Brisk), and run a MapReduce job or >>> Hive query to extract and summarize/renormalize the data into whatever >>> format you like. >>> >>> If you use sstable2json, you have to run on every file on every node, >>> deduplicate/merge all the output across machines, which is what MR >>> does anyway. >>> >>> Our data flow is to take backups of a production cluster, restore a >>> backup to a different cluster running Hadoop, then run our point in >>> time data extraction for ETL processing by the BI team. The >>> backup/restore gives a frozen in time (consistent to a second or so) >>> cluster for extraction. Running live with Brisk means you are running >>> your extraction over a moving target. >>> >>> Adrian >>> >>> On Sun, May 22, 2011 at 11:14 PM, Yang wrote: >>> > Thanks Jonathan. >>> > >>> > On Sun, May 22, 2011 at 9:56 PM, Jonathan Ellis >>> > wrote: >>> >> I'd modify SSTableExport.serializeRow (the sstable2json class) to >>> >> output to whatever system you are targeting. >>> >> >>> >> On Sun, May 22, 2011 at 11:19 PM, Yang wrote= : >>> >>> let's say periodically (daily) I need to dump out the contents of m= y >>> >>> Cassandra DB, and do a import into oracle , or some other custom da= ta >>> >>> stores, >>> >>> is there a way to do it? >>> >>> >>> >>> I checked that you can do multi-get() but you probably can't pass t= he >>> >>> entire key domain into the API, cuz the entire db would be returned= on >>> >>> a single thrift call, and probably overflow the >>> >>> API? plus multi-get underneath just sends out per-key lookups one b= y >>> >>> one, while I really do not care about which key corresponds to whic= h >>> >>> result, a simple scraping of the underlying SSTable would >>> >>> be perfect, because I could utilize the file cache coherency as I r= ead >>> >>> down the file. >>> >>> >>> >>> >>> >>> Thanks >>> >>> Yang >>> >>> >>> >> >>> >> >>> >> >>> >> -- >>> >> Jonathan Ellis >>> >> Project Chair, Apache Cassandra >>> >> co-founder of DataStax, the source for professional Cassandra suppor= t >>> >> http://www.datastax.com >>> >> >>> > >> >> >> >> -- >> SriSatish Ambati >> Director of Engineering, DataStax >> @srisatish >> >> >> >> >> >