Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@cassandra.apache.org
Received-SPF: pass (athena.apache.org: domain of
 paulo.motta@chaordicsystems.com designates 209.85.213.54 as permitted sender)
MIME-Version: 1.0
In-Reply-To: 
 <CAFSAEXQ6xwBZ0DDtFok0vrtztUTCTfv5BNZSgZR-g=T7+aok3Q@mail.gmail.com>
References: 
 <CAFSAEXQ6xwBZ0DDtFok0vrtztUTCTfv5BNZSgZR-g=T7+aok3Q@mail.gmail.com>
From: Paulo Ricardo Motta Gomes <paulo.motta@chaordicsystems.com>
Date: Thu, 9 Oct 2014 12:38:20 -0300
Message-ID: 
 <CAM+WaZiuz0LCGdtZT3QvAVLBe7OMn7mSVsghr=fHgr=oHCKFww@mail.gmail.com>
Subject: Re: efficiently generate complete database dump in text format
To: user@cassandra.apache.org
Content-Type: multipart/alternative; boundary=20cf303a32c90b666a0504ff3a74

--20cf303a32c90b666a0504ff3a74
Content-Type: text/plain; charset=UTF-8

The best way to generate dumps from Cassandra is via Hadoop integration (or
spark). You can find more info here:

http://www.datastax.com/documentation/cassandra/2.1/cassandra/configuration/configHadoop.html
http://wiki.apache.org/cassandra/HadoopSupport

On Thu, Oct 9, 2014 at 4:19 AM, Gaurav Bhatnagar <gbhatnagar@gmail.com>
wrote:

> Hi,
>    We have a Cassandra database column family containing 320 millions rows
> and each row contains about 15 columns. We want to take monthly dump of
> this single column family contained in this database in text format.
>
> We are planning to take following approach to implement this functionality
> 1. Take a snapshot of Cassandra database using nodetool utility. We
> specify -cf flag to
>      specify column family name so that snapshot contains data
> corresponding to a single
>      column family.
> 2. We take backup of this snapshot and move this backup to a separate
> physical machine.
> 3. We using "SStable to json conversion" utility to json convert all the
> data files into json
>     format.
>
> We have following questions/doubts regarding the above approach
> a) Generated json records contains "d" (IS_MARKED_FOR_DELETE) flag in json
> record
>      and can I safely ignore all such json records?
> b) If I ignore all records marked by "d" flag, than can generated json
> files in step 3, contain
>     duplicate records? I mean do multiple entries for same key.
>
> Do there can be any other better approach to generate data dumps in text
> format.
>
> Regards,
> Gaurav
>


-- 
*Paulo Motta*

Chaordic | *Platform*
*www.chaordic.com.br <http://www.chaordic.com.br/>*
+55 48 3232.3200

--20cf303a32c90b666a0504ff3a74
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr">The best way to generate dumps from Cassandra is via Hadoo=
p integration (or spark). You can find more info here:<div><br></div><div><=
a href=3D"http://www.datastax.com/documentation/cassandra/2.1/cassandra/con=
figuration/configHadoop.html">http://www.datastax.com/documentation/cassand=
ra/2.1/cassandra/configuration/configHadoop.html</a><br></div><div><a href=
=3D"http://wiki.apache.org/cassandra/HadoopSupport">http://wiki.apache.org/=
cassandra/HadoopSupport</a><br></div></div><div class=3D"gmail_extra"><br><=
div class=3D"gmail_quote">On Thu, Oct 9, 2014 at 4:19 AM, Gaurav Bhatnagar =
<span dir=3D"ltr">&lt;<a href=3D"mailto:gbhatnagar@gmail.com" target=3D"_bl=
ank">gbhatnagar@gmail.com</a>&gt;</span> wrote:<br><blockquote class=3D"gma=
il_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-lef=
t:1ex"><div dir=3D"ltr"><div><div>Hi,<br>=C2=A0=C2=A0 We have a Cassandra d=
atabase column family containing 320 millions rows and each row contains ab=
out 15 columns. We want to take monthly dump of this single column family c=
ontained in this database in text format. <br><br>We are planning to take f=
ollowing approach to implement this functionality<br>1. Take a snapshot of =
Cassandra database using nodetool utility. We specify -cf flag to=C2=A0=C2=
=A0 <br>=C2=A0=C2=A0=C2=A0=C2=A0 specify column family name so that snapsho=
t contains data corresponding to a single <br>=C2=A0=C2=A0=C2=A0=C2=A0 colu=
mn family.<br>2. We take backup of this snapshot and move this backup to a =
separate physical machine.<br>3. We using &quot;SStable to json conversion&=
quot; utility to json convert all the data files into json <br>=C2=A0=C2=A0=
=C2=A0 format.<br><br>We have following questions/doubts regarding the abov=
e approach<br>a) Generated json records contains &quot;d&quot; (IS_MARKED_F=
OR_DELETE) flag in json record <br>=C2=A0=C2=A0=C2=A0=C2=A0 and can I safel=
y ignore all such json records?<br>b) If I ignore all records marked by &qu=
ot;d&quot; flag, than can generated json files in step 3, contain <br>=C2=
=A0=C2=A0=C2=A0 duplicate records? I mean do multiple entries for same key.=
 <br><br>Do there can be any other better approach to generate data dumps i=
n text format.<br><br></div>Regards,<br></div>Gaurav<br></div>
</blockquote></div><br><br clear=3D"all"><div><br></div>-- <br><div dir=3D"=
ltr"><div style=3D"background-color:rgb(255,255,255)"><b>Paulo Motta</b></d=
iv><div style=3D"background-color:rgb(255,255,255)"><br></div><div style=3D=
"font-family:arial,sans-serif;font-size:12.727272033691406px;background-col=
or:rgb(255,255,255)"><div style=3D"color:rgb(136,136,136);font-size:small;f=
ont-family:arial"><span style=3D"color:rgb(68,68,68)">Chaordic | <i>Platfor=
m</i></span><br></div><div style=3D"color:rgb(136,136,136);font-size:small;=
font-family:arial"><u><a href=3D"http://www.chaordic.com.br/" style=3D"colo=
r:rgb(17,85,204)" target=3D"_blank"><font color=3D"#444444">www.chaordic.co=
m.br</font></a></u></div><div style=3D"color:rgb(136,136,136);font-size:sma=
ll;font-family:arial"><font size=3D"1" color=3D"#666666">+55 48 3232.3200</=
font></div></div></div>
</div>

--20cf303a32c90b666a0504ff3a74--