cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Marcel Steinbach <marcel.steinb...@chors.de>
Subject Re: Get all keys from the cluster
Date Sat, 21 Jan 2012 11:53:38 GMT
Thanks for your suggestions, Eric!

> One of the application uses 1.5TB out of 1.8TB

I'm sorry, maybe that statment was slightly ambiguous. I meant to say, that one application
uses 1.5TB, while the others use 300GB, totalling in 1.8TB of data. Our total disk capacity,
however, is at about 7 TB, so we're still far from running out of disk space.

> Is there any way that you could do that lookup in reverse where you pull the records
from your SQL database, figure out which keys aren't necessary, and then delete any unnecessary
keys that may or may not exist in cassandra? 
Unfortunately, that won't work since the SQL db does only contain the keys, that we want to
_keep_ in cassandra.

> If that's not a possibility, then what about creating the same Cassandra schema in a
different keyspace and copying all the relevant records from the current keyspace to the new
keyspace using the SQL database records as a basis for what is actually "relevant" within
the new keyspace.  
I like that idea. So instead of iterating over all cassandra rows, I would iterate over the
SQL DB, which would indeed save me a lot of IO. However, rows inserted into my CF during iterating
over the SQL DB might not be copied into the new keyspace. But maybe we could arrange to do
that 
during low-demand-hours to minimize the amount of new inserts and additionally run the "copy"
a second time with a select on newly inserted sql rows. So we'll probably go with that.

Thanks again for your help!

Cheers
Marcel

On 21.01.2012, at 11:52, Eric Czech wrote:

> Is there any way that you could do that lookup in reverse where you pull the records
from your SQL database, figure out which keys aren't necessary, and then delete any unnecessary
keys that may or may not exist in cassandra?  
> 
> If that's not a possibility, then what about creating the same Cassandra schema in a
different keyspace and copying all the relevant records from the current keyspace to the new
keyspace using the SQL database records as a basis for what is actually "relevant" within
the new keyspace.  If you could perform that transfer, then you could just delete the old
1.5TB keyspace altogether, leaving only the data you need.  If that sort of duplication would
put you over the 1.8TB limit during the transfer, then maybe you could consider CF compression
upfront.
> 
> Short of that, I can tell from experience that doing these sort of "left join" deletes
from cassandra to SQL really suck.  We have had to resort to using hadoop to do this but since
our hadoop/cassandra clusters are much larger than our single SQL instances, keeping all the
hadoop processes from basically "DDoS"ing our SQL servers while still making the process faster
than thrift iterations over all the rows (via custom programs) in cassandra hasn't been a
convincing solution.
> 
> I'd say that the first solution I proposed is definitely the best, but also the most
unrealistic.  If that's really not a possibility for you, then I'd seriously look at trying
to make my second suggestion work even if it means brining up new hardware or increasing the
capacity of existing resources.  That second suggestion also has the added benefit of likely
minimizing I/O since it's the only solution that doesn't require reading or deleting any of
the unnecessary data (beyond wholesale keyspace or CF deletions) assuming that the actually
relevant portion of your data is significantly less than 1.5TB.  
> 
> I hope that helps!
> 
> And in the future, you should really try to avoid letting your data size get beyond 40
- 50 % of your actual on-disk capacity.  Let me know if anyone in the community disagrees,
but I'd say you're about 600 GB past the point at which you have a lot of easy outs -- but
I hope you find one anyways!
> 
> 
> On Sat, Jan 21, 2012 at 2:45 AM, Marcel Steinbach <marcel.steinbach@chors.de> wrote:
> We're running a 8 node cluster with different CFs for different applications. One of
the application uses 1.5TB out of 1.8TB in total, but only because we started out with a deletion
mechanism and implemented one later on. So there is probably a high amount of old data in
there, that we don't even use anymore.
> 
> Now we want to delete that data. To know, which rows we may delete, we have to lookup
a SQL database. If the key is not in there anymore, we may delete that row in cassandra, too.
> 
> This basically means, we have to iterate over all the rows in that CF. This kind of begs
for hadoop, but that seems not to be an option, currently. I tried.
> 
> So we figured, we could run over the sstables files (maybe only the index), check the
keys in the mysql, and later run the deletes on the cluster. This way, we could iterate on
each node in parallel.
> 
> Does that sound reasonable? Any pros/cons, maybe a "killer" argument to use hadoop for
that?
> 
> Cheers
> Marcel
> <hr style="border-color:blue">
> <p>chors GmbH
> <br><hr style="border-color:blue">
> <p>specialists in digital and direct marketing solutions<br>
> Haid-und-Neu-Stra├če 7<br>
> 76131 Karlsruhe, Germany<br>
> www.chors.com</p>
> <p>Managing Directors: Dr. Volker Hatz, Markus Plattner<br>Amtsgericht Montabaur,
HRB 15029</p>
> <p style="font-size:9px">This e-mail is for the intended recipient only and may
contain confidential or privileged information. If you have received this e-mail by mistake,
please contact us immediately and completely delete it (and any attachments) and do not forward
it or inform any other person of its contents. If you send us messages by e-mail, we take
this as your authorization to correspond with you by e-mail. E-mail transmission cannot be
guaranteed to be secure or error-free as information could be intercepted, amended, corrupted,
lost, destroyed, arrive late or incomplete, or contain viruses. Neither chors GmbH nor the
sender accept liability for any errors or omissions in the content of this message which arise
as a result of its e-mail transmission. Please note that all e-mail communications to and
from chors GmbH may be monitored.</p>
> 


Mime
View raw message