hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jonathan Gray <jg...@facebook.com>
Subject RE: Efficient mass deletes
Date Fri, 02 Apr 2010 16:26:00 GMT
Juhani,

Deletes are really special versions of Puts (so they are equally fast).  I suppose it would
be possible to have some kind of special filter that issued deletes server-side but seems
dangerous :)  That's beyond even the notion of stateful scanners which are tricky as is.

MultiDelete would actually process those deletes in parallel, concurrently running across
all the servers, so is a bit more than just List<Delete> under the covers.  Or at least
that's the intention, I don't think it's built.

Are you running into performance issues doing the deletes currently, or are you just expecting
to run into problems?  I would think that if it was taking too long to run from a sequential
client, a parallel MultiDelete would solve your problems.

JG

> -----Original Message-----
> From: Juhani Connolly [mailto:juhani@ninja.co.jp]
> Sent: Thursday, April 01, 2010 10:44 PM
> To: hbase-user@hadoop.apache.org
> Subject: Efficient mass deletes
> 
> Having an issue with table design regarding how to delete old/obsolete
> data.
> 
> I have raw names in a non-time sorted manner, id first followed by
> timestamp, the main objective being running big scans on specific id's
> from time x to time y.
> 
> However this data builds up at a respectable rate and I need a method
> to
> delete old records en masse. I considered using the ttl parameter on
> the
> column families, but the current plan is to selectively store data for
> a
> longer time for specific id's.
> 
> Are there any plans to link a delete operation with a scanner(so delete
> range x-y, or if you supply a filter, delete when conditions p and q
> are
> met).
> 
> If not what would be the recommended method to handle these kind of
> batch deletes?
> The current JIRA for MultiDelete (
> http://issues.apache.org/jira/browse/HBASE-1845 )  simply implements
> deleting on a List<Delete>, which still seems limited.
> 
> Is the only way to do this to run a scan, and then build a List from
> that to use with the multi call discussed in HBASE-1845? This feels
> very
> inefficient but please correct me if I'm mistaken. Current activity
> estimate is about 10million rows a day, generating about 300million
> cells, which would need to be deleted on a regular basis(so 300mil
> cells
> every day or 2.1bil once a week)

Mime
View raw message