Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@cassandra.apache.org
Received-SPF: pass (nike.apache.org: local policy)
From: Marcel Steinbach <marcel.steinbach@chors.de>
Mime-Version: 1.0 (Apple Message framework v1084)
Content-Type: multipart/alternative; boundary=Apple-Mail-2-947267854
Subject: Re: Get all keys from the cluster
Date: Sat, 21 Jan 2012 12:53:38 +0100
In-Reply-To: 
 <CACjoVsJLmH-5+kujJvFG5xKZndxjdwwDHM82u0r38uPK+_QeRw@mail.gmail.com>
To: user@cassandra.apache.org
References: <6DC9BC04-E6A9-40DD-BDED-58F082F4E0BF@chors.de>
 <CACjoVsJLmH-5+kujJvFG5xKZndxjdwwDHM82u0r38uPK+_QeRw@mail.gmail.com>
Message-Id: <729B09E2-160A-471D-ABBF-8915290C1C40@chors.de>


--Apple-Mail-2-947267854
Content-Transfer-Encoding: quoted-printable
Content-Type: text/plain;
	charset=iso-8859-1

Thanks for your suggestions, Eric!

> One of the application uses 1.5TB out of 1.8TB

I'm sorry, maybe that statment was slightly ambiguous. I meant to say, =
that one application uses 1.5TB, while the others use 300GB, totalling =
in 1.8TB of data. Our total disk capacity, however, is at about 7 TB, so =
we're still far from running out of disk space.

> Is there any way that you could do that lookup in reverse where you =
pull the records from your SQL database, figure out which keys aren't =
necessary, and then delete any unnecessary keys that may or may not =
exist in cassandra?=20
Unfortunately, that won't work since the SQL db does only contain the =
keys, that we want to _keep_ in cassandra.

> If that's not a possibility, then what about creating the same =
Cassandra schema in a different keyspace and copying all the relevant =
records from the current keyspace to the new keyspace using the SQL =
database records as a basis for what is actually "relevant" within the =
new keyspace. =20
I like that idea. So instead of iterating over all cassandra rows, I =
would iterate over the SQL DB, which would indeed save me a lot of IO. =
However, rows inserted into my CF during iterating over the SQL DB might =
not be copied into the new keyspace. But maybe we could arrange to do =
that=20
during low-demand-hours to minimize the amount of new inserts and =
additionally run the "copy" a second time with a select on newly =
inserted sql rows. So we'll probably go with that.

Thanks again for your help!

Cheers
Marcel

On 21.01.2012, at 11:52, Eric Czech wrote:

> Is there any way that you could do that lookup in reverse where you =
pull the records from your SQL database, figure out which keys aren't =
necessary, and then delete any unnecessary keys that may or may not =
exist in cassandra? =20
>=20
> If that's not a possibility, then what about creating the same =
Cassandra schema in a different keyspace and copying all the relevant =
records from the current keyspace to the new keyspace using the SQL =
database records as a basis for what is actually "relevant" within the =
new keyspace.  If you could perform that transfer, then you could just =
delete the old 1.5TB keyspace altogether, leaving only the data you =
need.  If that sort of duplication would put you over the 1.8TB limit =
during the transfer, then maybe you could consider CF compression =
upfront.
>=20
> Short of that, I can tell from experience that doing these sort of =
"left join" deletes from cassandra to SQL really suck.  We have had to =
resort to using hadoop to do this but since our hadoop/cassandra =
clusters are much larger than our single SQL instances, keeping all the =
hadoop processes from basically "DDoS"ing our SQL servers while still =
making the process faster than thrift iterations over all the rows (via =
custom programs) in cassandra hasn't been a convincing solution.
>=20
> I'd say that the first solution I proposed is definitely the best, but =
also the most unrealistic.  If that's really not a possibility for you, =
then I'd seriously look at trying to make my second suggestion work even =
if it means brining up new hardware or increasing the capacity of =
existing resources.  That second suggestion also has the added benefit =
of likely minimizing I/O since it's the only solution that doesn't =
require reading or deleting any of the unnecessary data (beyond =
wholesale keyspace or CF deletions) assuming that the actually relevant =
portion of your data is significantly less than 1.5TB. =20
>=20
> I hope that helps!
>=20
> And in the future, you should really try to avoid letting your data =
size get beyond 40 - 50 % of your actual on-disk capacity.  Let me know =
if anyone in the community disagrees, but I'd say you're about 600 GB =
past the point at which you have a lot of easy outs -- but I hope you =
find one anyways!
>=20
>=20
> On Sat, Jan 21, 2012 at 2:45 AM, Marcel Steinbach =
<marcel.steinbach@chors.de> wrote:
> We're running a 8 node cluster with different CFs for different =
applications. One of the application uses 1.5TB out of 1.8TB in total, =
but only because we started out with a deletion mechanism and =
implemented one later on. So there is probably a high amount of old data =
in there, that we don't even use anymore.
>=20
> Now we want to delete that data. To know, which rows we may delete, we =
have to lookup a SQL database. If the key is not in there anymore, we =
may delete that row in cassandra, too.
>=20
> This basically means, we have to iterate over all the rows in that CF. =
This kind of begs for hadoop, but that seems not to be an option, =
currently. I tried.
>=20
> So we figured, we could run over the sstables files (maybe only the =
index), check the keys in the mysql, and later run the deletes on the =
cluster. This way, we could iterate on each node in parallel.
>=20
> Does that sound reasonable? Any pros/cons, maybe a "killer" argument =
to use hadoop for that?
>=20
> Cheers
> Marcel
> <hr style=3D"border-color:blue">
> <p>chors GmbH
> <br><hr style=3D"border-color:blue">
> <p>specialists in digital and direct marketing solutions<br>
> Haid-und-Neu-Stra=DFe 7<br>
> 76131 Karlsruhe, Germany<br>
> www.chors.com</p>
> <p>Managing Directors: Dr. Volker Hatz, Markus Plattner<br>Amtsgericht =
Montabaur, HRB 15029</p>
> <p style=3D"font-size:9px">This e-mail is for the intended recipient =
only and may contain confidential or privileged information. If you have =
received this e-mail by mistake, please contact us immediately and =
completely delete it (and any attachments) and do not forward it or =
inform any other person of its contents. If you send us messages by =
e-mail, we take this as your authorization to correspond with you by =
e-mail. E-mail transmission cannot be guaranteed to be secure or =
error-free as information could be intercepted, amended, corrupted, =
lost, destroyed, arrive late or incomplete, or contain viruses. Neither =
chors GmbH nor the sender accept liability for any errors or omissions =
in the content of this message which arise as a result of its e-mail =
transmission. Please note that all e-mail communications to and from =
chors GmbH may be monitored.</p>
>=20


--Apple-Mail-2-947267854
Content-Transfer-Encoding: quoted-printable
Content-Type: text/html;
	charset=iso-8859-1

<html><head></head><body style=3D"word-wrap: break-word; =
-webkit-nbsp-mode: space; -webkit-line-break: after-white-space; =
"><div><span class=3D"Apple-style-span" style=3D"border-collapse: =
separate; color: rgb(0, 0, 0); font-family: Helvetica; font-style: =
normal; font-variant: normal; font-weight: normal; letter-spacing: =
normal; line-height: normal; orphans: 2; text-align: -webkit-auto; =
text-indent: 0px; text-transform: none; white-space: normal; widows: 2; =
word-spacing: 0px; -webkit-border-horizontal-spacing: 0px; =
-webkit-border-vertical-spacing: 0px; =
-webkit-text-decorations-in-effect: none; -webkit-text-size-adjust: =
auto; -webkit-text-stroke-width: 0px; font-size: medium; "><span =
class=3D"Apple-style-span" style=3D"border-collapse: separate; color: =
rgb(0, 0, 0); font-family: Helvetica; font-style: normal; font-variant: =
normal; font-weight: normal; letter-spacing: normal; line-height: =
normal; orphans: 2; text-indent: 0px; text-transform: none; white-space: =
normal; widows: 2; word-spacing: 0px; -webkit-border-horizontal-spacing: =
0px; -webkit-border-vertical-spacing: 0px; =
-webkit-text-decorations-in-effect: none; -webkit-text-size-adjust: =
auto; -webkit-text-stroke-width: 0px; font-size: medium; "><div =
style=3D"word-wrap: break-word; -webkit-nbsp-mode: space; =
-webkit-line-break: after-white-space; "><span class=3D"Apple-style-span" =
style=3D"border-collapse: separate; color: rgb(0, 0, 0); font-family: =
Helvetica; font-style: normal; font-variant: normal; font-weight: =
normal; letter-spacing: normal; line-height: normal; orphans: 2; =
text-indent: 0px; text-transform: none; white-space: normal; widows: 2; =
word-spacing: 0px; -webkit-border-horizontal-spacing: 0px; =
-webkit-border-vertical-spacing: 0px; =
-webkit-text-decorations-in-effect: none; -webkit-text-size-adjust: =
auto; -webkit-text-stroke-width: 0px; font-size: medium; "><div =
style=3D"word-wrap: break-word; -webkit-nbsp-mode: space; =
-webkit-line-break: after-white-space; "><span class=3D"Apple-style-span" =
style=3D"border-collapse: separate; color: rgb(0, 0, 0); font-family: =
Helvetica; font-style: normal; font-variant: normal; font-weight: =
normal; letter-spacing: normal; line-height: normal; orphans: 2; =
text-indent: 0px; text-transform: none; white-space: normal; widows: 2; =
word-spacing: 0px; -webkit-border-horizontal-spacing: 0px; =
-webkit-border-vertical-spacing: 0px; =
-webkit-text-decorations-in-effect: none; -webkit-text-size-adjust: =
auto; -webkit-text-stroke-width: 0px; font-size: medium; "><div =
style=3D"word-wrap: break-word; -webkit-nbsp-mode: space; =
-webkit-line-break: after-white-space; "><span class=3D"Apple-style-span" =
style=3D"border-collapse: separate; color: rgb(0, 0, 0); font-family: =
Helvetica; font-style: normal; font-variant: normal; font-weight: =
normal; letter-spacing: normal; line-height: normal; orphans: 2; =
text-indent: 0px; text-transform: none; white-space: normal; widows: 2; =
word-spacing: 0px; -webkit-border-horizontal-spacing: 0px; =
-webkit-border-vertical-spacing: 0px; =
-webkit-text-decorations-in-effect: none; -webkit-text-size-adjust: =
auto; -webkit-text-stroke-width: 0px; font-size: medium; "><div =
style=3D"word-wrap: break-word; -webkit-nbsp-mode: space; =
-webkit-line-break: after-white-space; "><span class=3D"Apple-style-span" =
style=3D"border-collapse: separate; color: rgb(0, 0, 0); font-family: =
Helvetica; font-style: normal; font-variant: normal; font-weight: =
normal; letter-spacing: normal; line-height: normal; orphans: 2; =
text-indent: 0px; text-transform: none; white-space: normal; widows: 2; =
word-spacing: 0px; -webkit-border-horizontal-spacing: 0px; =
-webkit-border-vertical-spacing: 0px; =
-webkit-text-decorations-in-effect: none; -webkit-text-size-adjust: =
auto; -webkit-text-stroke-width: 0px; font-size: medium; "><div =
style=3D"word-wrap: break-word; -webkit-nbsp-mode: space; =
-webkit-line-break: after-white-space; ">Thanks for your suggestions, =
Eric!</div></span></div></span></div></span></div></span></div></span></sp=
an></div><div><br =
class=3D"webkit-block-placeholder"></div><div><blockquote =
type=3D"cite"><div><div><div class=3D"gmail_quote"><blockquote =
class=3D"gmail_quote" style=3D"margin-top: 0px; margin-right: 0px; =
margin-bottom: 0px; margin-left: 0.8ex; border-left-width: 1px; =
border-left-color: rgb(204, 204, 204); border-left-style: solid; =
padding-left: 1ex; position: static; z-index: auto; ">One of the =
application uses 1.5TB out of =
1.8TB</blockquote></div></div></div></blockquote></div><div>I'm sorry, =
maybe that statment was slightly ambiguous. I meant to say, that one =
application uses 1.5TB, while the others use 300GB, totalling in 1.8TB =
of data. Our total disk capacity, however, is at about 7 TB, so we're =
still far from running out of disk =
space.</div><div><br></div><div><blockquote type=3D"cite">Is there any =
way that you could do that lookup in reverse where you pull the records =
from your SQL database, figure out which keys aren't necessary, and then =
delete any unnecessary keys that may or may not exist in =
cassandra?&nbsp;</blockquote><div>Unfortunately, that won't work since =
the SQL db does only contain the keys, that we want to _keep_ in =
cassandra.</div><br><blockquote type=3D"cite"><div>If that's not a =
possibility, then what about creating the same Cassandra schema in a =
different keyspace and copying all the relevant records from the current =
keyspace to the new keyspace using the SQL database records as a basis =
for what is actually "relevant" within the new keyspace. =
&nbsp;</div></blockquote><div>I like that idea. So instead of iterating =
over all cassandra rows, I would iterate over the SQL DB, which would =
indeed save me a lot of IO. However, rows inserted into my CF during =
iterating over the SQL DB might not be copied into the new keyspace. But =
maybe we could arrange to do that&nbsp;</div><div>during =
low-demand-hours to minimize the amount of new inserts and additionally =
run the "copy" a second time with a select on newly inserted sql =
rows.&nbsp;So we'll probably go with =
that.</div><div><div><br></div></div><div>Thanks again for your =
help!</div><div><br></div><div>Cheers</div><div>Marcel</div></div><div>
</div>
<br><div><div>On 21.01.2012, at 11:52, Eric Czech wrote:</div><br =
class=3D"Apple-interchange-newline"><blockquote type=3D"cite">Is there =
any way that you could do that lookup in reverse where you pull the =
records from your SQL database, figure out which keys aren't necessary, =
and then delete any unnecessary keys that may or may not exist in =
cassandra? &nbsp;<div>

<br></div><div>If that's not a possibility, then what about creating the =
same Cassandra schema in a different keyspace and copying all the =
relevant records from the current keyspace to the new keyspace using the =
SQL database records as a basis for what is actually "relevant" within =
the new keyspace. &nbsp;If you could perform that transfer, then you =
could just delete the old 1.5TB keyspace altogether, leaving only the =
data you need. &nbsp;If that sort of duplication would put you over the =
1.8TB limit during the transfer, then maybe you could consider CF =
compression upfront.</div>

<div><br></div><div>Short of that, I can tell from experience that doing =
these sort of "left join" deletes from cassandra to SQL really suck. =
&nbsp;We have had to resort to using hadoop to do this but since our =
hadoop/cassandra clusters are much larger than our single SQL instances, =
keeping all the hadoop processes from basically "DDoS"ing our SQL =
servers while still making the process faster than thrift iterations =
over all the rows (via custom programs) in cassandra hasn't been a =
convincing solution.</div>

<div><br></div><div>I'd say that the first solution I proposed is =
definitely the best, but also the most unrealistic. &nbsp;If that's =
really not a possibility for you, then I'd seriously look at trying to =
make my second suggestion work even if it means brining up new hardware =
or increasing the capacity of existing resources. &nbsp;That second =
suggestion also has the added benefit of likely minimizing I/O since =
it's the only solution that doesn't require reading or deleting any of =
the unnecessary data (beyond wholesale keyspace or CF deletions) =
assuming that the actually relevant portion of your data is =
significantly less than 1.5TB. &nbsp;</div>

<div><br></div><div>I hope that helps!</div><div><br></div><div>And in =
the future, you should really try to avoid letting your data size get =
beyond 40 - 50 % of your actual on-disk capacity. &nbsp;Let me know if =
anyone in the community disagrees, but I'd say you're about 600 GB past =
the point at which you have a lot of easy outs -- but I hope you find =
one anyways!</div>

<div><br></div><div><br><div><div class=3D"gmail_quote">On Sat, Jan 21, =
2012 at 2:45 AM, Marcel Steinbach <span dir=3D"ltr">&lt;<a =
href=3D"mailto:marcel.steinbach@chors.de">marcel.steinbach@chors.de</a>&gt=
;</span> wrote:<br><blockquote class=3D"gmail_quote" style=3D"margin:0 0 =
0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

We're running a 8 node cluster with different CFs for different =
applications. One of the application uses 1.5TB out of 1.8TB in total, =
but only because we started out with a deletion mechanism and =
implemented one later on. So there is probably a high amount of old data =
in there, that we don't even use anymore.<br>


<br>
Now we want to delete that data. To know, which rows we may delete, we =
have to lookup a SQL database. If the key is not in there anymore, we =
may delete that row in cassandra, too.<br>
<br>
This basically means, we have to iterate over all the rows in that CF. =
This kind of begs for hadoop, but that seems not to be an option, =
currently. I tried.<br>
<br>
So we figured, we could run over the sstables files (maybe only the =
index), check the keys in the mysql, and later run the deletes on the =
cluster. This way, we could iterate on each node in parallel.<br>
<br>
Does that sound reasonable? Any pros/cons, maybe a "killer" argument to =
use hadoop for that?<br>
<br>
Cheers<br>
<span class=3D"HOEnZb"><font =
color=3D"#888888">Marcel</font></span><br>&lt;hr =
style=3D"border-color:blue"&gt;<br>
&lt;p&gt;chors GmbH<br>
&lt;br&gt;&lt;hr style=3D"border-color:blue"&gt;<br>
&lt;p&gt;specialists in digital and direct marketing =
solutions&lt;br&gt;<br>
Haid-und-Neu-Stra=DFe 7&lt;br&gt;<br>
76131 Karlsruhe, Germany&lt;br&gt;<br>
<a href=3D"http://www.chors.com/" =
target=3D"_blank">www.chors.com</a>&lt;/p&gt;<br>
&lt;p&gt;Managing Directors: Dr. Volker Hatz, Markus =
Plattner&lt;br&gt;Amtsgericht Montabaur, HRB 15029&lt;/p&gt;<br>
&lt;p style=3D"font-size:9px"&gt;This e-mail is for the intended =
recipient only and may contain confidential or privileged information. =
If you have received this e-mail by mistake, please contact us =
immediately and completely delete it (and any attachments) and do not =
forward it or inform any other person of its contents. If you send us =
messages by e-mail, we take this as your authorization to correspond =
with you by e-mail. E-mail transmission cannot be guaranteed to be =
secure or error-free as information could be intercepted, amended, =
corrupted, lost, destroyed, arrive late or incomplete, or contain =
viruses. Neither chors GmbH nor the sender accept liability for any =
errors or omissions in the content of this message which arise as a =
result of its e-mail transmission. Please note that all e-mail =
communications to and from chors GmbH may be monitored.&lt;/p&gt;<br>

</blockquote></div><br></div></div>
</blockquote></div><br></body></html>=

--Apple-Mail-2-947267854--