Return-Path: X-Original-To: apmail-cassandra-user-archive@www.apache.org Delivered-To: apmail-cassandra-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id DF89517FAC for ; Thu, 9 Oct 2014 15:39:08 +0000 (UTC) Received: (qmail 97501 invoked by uid 500); 9 Oct 2014 15:39:05 -0000 Delivered-To: apmail-cassandra-user-archive@cassandra.apache.org Received: (qmail 97463 invoked by uid 500); 9 Oct 2014 15:39:05 -0000 Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@cassandra.apache.org Delivered-To: mailing list user@cassandra.apache.org Received: (qmail 97453 invoked by uid 99); 9 Oct 2014 15:39:05 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 09 Oct 2014 15:39:05 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of paulo.motta@chaordicsystems.com designates 209.85.213.54 as permitted sender) Received: from [209.85.213.54] (HELO mail-yh0-f54.google.com) (209.85.213.54) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 09 Oct 2014 15:39:01 +0000 Received: by mail-yh0-f54.google.com with SMTP id z6so873473yhz.13 for ; Thu, 09 Oct 2014 08:38:40 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=chaordicsystems.com; s=google; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :content-type; bh=iL4UNCwRam8kTl2bcwiQNXrX3S6vu0di+E6OKYD9wBg=; b=FcSMY8WLVzgkVfGBVtMsK+JfpaCDsn3IE6xemhH8QNQ+U7Rwj4/elwObkvIrxn/McG R0IXEzaD6EP8L4t5PJtdgAmqAFDL9Q2mIDuny1X6SFTfM3Ow9QhzvBIDxsCzuanoinwB tEGVSvOhgsl/HMWXoagaF4FdbUtT/HdLxg4vM= X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:in-reply-to:references:from:date :message-id:subject:to:content-type; bh=iL4UNCwRam8kTl2bcwiQNXrX3S6vu0di+E6OKYD9wBg=; b=JYOFRFs6CZLrvty75JbSMlWO7urGZbeYc1urjPgLnqb6TJCvPO1M8K3zExRjGP/OaY nLpdBrHX5fCZmkcPBqN4BPMf0gWJFHenFsfhfssJWcCKgcpUAu5BFiy1wyNAPrS0mdmg k/wwMrYTzp7Tor/a50WBo+AiXfJE4d4qJmlcVl5L9cK9USJpiBqcLhUXEt58lUa/Ye+r //pKyZAjdH0MOzn/tQqXYS/GvpobFrYL1kRMbAsp3wPMClaTxzj5tzvpWK7ATb16Cn+2 4fdKyMcpsiMLqNLgM+x+5evAlWIO5j7MXBGjWIoqwIyjl701thPGnSus5eoVU9DYh2FH 2ksQ== X-Gm-Message-State: ALoCoQnX/FSvNe1EiEh9MoTr2fS7ftkmz0NkmrbAbhOxi7IOK44B6ZaOUAd1cq/r7sAXZJpfjVa3 X-Received: by 10.236.133.165 with SMTP id q25mr26386009yhi.62.1412869120544; Thu, 09 Oct 2014 08:38:40 -0700 (PDT) MIME-Version: 1.0 Received: by 10.170.72.195 with HTTP; Thu, 9 Oct 2014 08:38:20 -0700 (PDT) In-Reply-To: References: From: Paulo Ricardo Motta Gomes Date: Thu, 9 Oct 2014 12:38:20 -0300 Message-ID: Subject: Re: efficiently generate complete database dump in text format To: user@cassandra.apache.org Content-Type: multipart/alternative; boundary=20cf303a32c90b666a0504ff3a74 X-Virus-Checked: Checked by ClamAV on apache.org --20cf303a32c90b666a0504ff3a74 Content-Type: text/plain; charset=UTF-8 The best way to generate dumps from Cassandra is via Hadoop integration (or spark). You can find more info here: http://www.datastax.com/documentation/cassandra/2.1/cassandra/configuration/configHadoop.html http://wiki.apache.org/cassandra/HadoopSupport On Thu, Oct 9, 2014 at 4:19 AM, Gaurav Bhatnagar wrote: > Hi, > We have a Cassandra database column family containing 320 millions rows > and each row contains about 15 columns. We want to take monthly dump of > this single column family contained in this database in text format. > > We are planning to take following approach to implement this functionality > 1. Take a snapshot of Cassandra database using nodetool utility. We > specify -cf flag to > specify column family name so that snapshot contains data > corresponding to a single > column family. > 2. We take backup of this snapshot and move this backup to a separate > physical machine. > 3. We using "SStable to json conversion" utility to json convert all the > data files into json > format. > > We have following questions/doubts regarding the above approach > a) Generated json records contains "d" (IS_MARKED_FOR_DELETE) flag in json > record > and can I safely ignore all such json records? > b) If I ignore all records marked by "d" flag, than can generated json > files in step 3, contain > duplicate records? I mean do multiple entries for same key. > > Do there can be any other better approach to generate data dumps in text > format. > > Regards, > Gaurav > -- *Paulo Motta* Chaordic | *Platform* *www.chaordic.com.br * +55 48 3232.3200 --20cf303a32c90b666a0504ff3a74 Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable
The best way to generate dumps from Cassandra is via Hadoo= p integration (or spark). You can find more info here:

<= a href=3D"http://www.datastax.com/documentation/cassandra/2.1/cassandra/con= figuration/configHadoop.html">http://www.datastax.com/documentation/cassand= ra/2.1/cassandra/configuration/configHadoop.html

<= div class=3D"gmail_quote">On Thu, Oct 9, 2014 at 4:19 AM, Gaurav Bhatnagar = <gbhatnagar@gmail.com> wrote:
Hi,
=C2=A0=C2=A0 We have a Cassandra d= atabase column family containing 320 millions rows and each row contains ab= out 15 columns. We want to take monthly dump of this single column family c= ontained in this database in text format.

We are planning to take f= ollowing approach to implement this functionality
1. Take a snapshot of = Cassandra database using nodetool utility. We specify -cf flag to=C2=A0=C2= =A0
=C2=A0=C2=A0=C2=A0=C2=A0 specify column family name so that snapsho= t contains data corresponding to a single
=C2=A0=C2=A0=C2=A0=C2=A0 colu= mn family.
2. We take backup of this snapshot and move this backup to a = separate physical machine.
3. We using "SStable to json conversion&= quot; utility to json convert all the data files into json
=C2=A0=C2=A0= =C2=A0 format.

We have following questions/doubts regarding the abov= e approach
a) Generated json records contains "d" (IS_MARKED_F= OR_DELETE) flag in json record
=C2=A0=C2=A0=C2=A0=C2=A0 and can I safel= y ignore all such json records?
b) If I ignore all records marked by &qu= ot;d" flag, than can generated json files in step 3, contain
=C2= =A0=C2=A0=C2=A0 duplicate records? I mean do multiple entries for same key.=

Do there can be any other better approach to generate data dumps i= n text format.

Regards,
Gaurav



--
Paulo Motta

Chaordic | Platfor= m
+55 48 3232.3200
--20cf303a32c90b666a0504ff3a74--