accumulo-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Keith Turner <ke...@deenlo.com>
Subject Re: Deleting rows from the Java API
Date Wed, 09 May 2012 15:13:00 GMT
On Wed, May 9, 2012 at 11:00 AM, Billie J Rinaldi
<billie.j.rinaldi@ugov.gov> wrote:
> On Wednesday, May 9, 2012 10:31:46 AM, "Sean Pines" <spines83@gmail.com> wrote:
>> I have a use case that involves me removing a record from Accumulo
>> based on the Row ID and the Column Family.
>>
>> In the shell, I noticed the command "deletemany" which allows you to
>> specify column family/column qualifier. Is there an equivalent of this
>> in the Java API?
>>
>> In the Java API, I noticed the method:
>> deleteRows(String tableName, org.apache.hadoop.io.Text start,
>> org.apache.hadoop.io.Text end)
>> Delete rows between (start, end]
>>
>> However that only seems to work for deleting a range of RowIDs
>>
>> I would also imagine that deleting rows is costly; is there a better
>> way to approach something like this?
>> The workaround I have for now is to just overwrite the row with an
>> empty string in the value field and ignore any entries that have that.
>> However this just leaves lingering rows for each "delete" and I'd like
>> to avoid that if at all possible.
>>
>> Thanks!
>
> Connector provides a createBatchDeleter method.  You can set the range and columns for
BatchDeleter just like you would with a Scanner.  This is not an efficient operation (despite
the current javadocs for BatchDeleter), but it works well if you're deleting a small number
of entries.  It scans for the affected key/value pairs, pulls them back to the client, then
inserts deletion entries for each.  The deleteRows method, on the other hand, is efficient
because large ranges can just be dropped.  If you want to delete a lot of things and deleteRows
won't work for you, consider using a majc scope Filter that filters out what you don't want,
compact the table, then remove the filter.

If using the filter option probably would want to put filter at all
scopes, flush, compact and then remove the filter.  Having the filter
at the scan scope prevents user from seeing any of the data
immediately.  If the filter is only at the majc scope, then users will
see the data in some part of the table while the compaction is
running.  Having the filter at the minc scope will filter out any data
in memory when you flush.  Having the filter at the majc scope will
filter existing data on disk when you compact.

>
> Billie

Mime
View raw message