accumulo-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Terry P." <>
Subject Re: How to remove entire row at the server side?
Date Wed, 06 Nov 2013 20:49:13 GMT
Eyes of an eagle Billie!  com is correct, but after viewing
"org.apache.accumulo" so many times, my brain was stuck on org and I goofed
in my setiter syntax.

With THAT corrected, here is the new error:

root@meta> setiter -class
com.esa.accumulo.iterators.ExpirationTimestampPurgeFilter -n expTsFilter -p
20 -scan -t itertest
2013-11-06 14:46:28,280 [shell.Shell] ERROR: Command could
not be initialized (Unable to load
com.esa.accumulo.iterators.ExpirationTimestampPurgeFilter as type
org.apache.accumulo.core.iterators.OptionDescriber; configure with 'config'

On Wed, Nov 6, 2013 at 2:43 PM, Billie Rinaldi <>wrote:

> Is there a typo in the package name?  One place says "com" and the other
> "org".
> On Wed, Nov 6, 2013 at 12:37 PM, Terry P. <> wrote:
>> Hi William, many thanks for the explanation of scan time versus
>> compaction time. I'll look through the classes again and note where the
>> remove versus suppress wordings are used and open a ticket.
>> As mentioned, I only dabble in java, but regardless of that fact at this
>> point I'm the one that has to get this done. I've hobbled together my first
>> attempt, but I get the following error where I try to add it as a scan
>> iterator for testing:
>> root@meta> setiter -class
>> org.esa.accumulo.iterators.ExpirationTimestampPurgeFilter -n expTsFilter -p
>> 20 -scan -t itertest
>> 2013-11-06 14:06:34,914 [shell.Shell] ERROR:
>> Command could
>> not be initialized (Servers are unable to load
>> org.esa.accumulo.iterators.ExpirationTimestampPurgeFilter as type
>> org.apache.accumulo.core.iterators.SortedKeyValueIterator)
>> Here's my source.  Note that the value stored in the expTs ColFam is in
>> the format "yyyyMMddHHmmssS", which I convert to a long for a direct
>> comparison to System.currentTimeMillis(). I only overrode the init and
>> acceptRow methods, hoping the others would work as-is from the base class.
>> One clarification: turns out expTs is the ColumnFamily, and the ingest
>> app does not assign a ColumnQualifier for expTs. So to amend my prior table
>> layout (including the datetime format):
>> Format: Key:CF:CQ:Value
>> abc:data:title:"My fantastic data"
>> abc:data:content:<bytedata>
>> abc:creTs::20130804171412445
>> abc:*expTs*::20131104171412445
>> ... 6-8 more columns of data per row ...
>> where *expTs* is the ColumnFamily to determine if the entire row should
>> be removed based on whether its value is <= NOW.  If a row has not yet been
>> assigned an expiration date, expTs will not be set and the ColumnFamily
>> will not yet be present.  Seems like an odd choice to use distinct Column
>> Families, without Column Qualifiers, but that's how the ingest app was done.
>> I greatly appreciate any advice you can provide.
>> package com.esa.accumulo.iterators;
>> import;
>> import java.text.ParseException;
>> import java.text.SimpleDateFormat;
>> import java.util.Date;
>> import java.util.Map;
>> import;
>> import;
>> import org.apache.accumulo.core.iterators.IteratorEnvironment;
>> import org.apache.accumulo.core.iterators.SortedKeyValueIterator;
>> import org.apache.accumulo.core.iterators.user.RowFilter;
>> /**
>>  * A filter that removes rows based on the column designated as the
>> "expiration timestamp" column family.
>>  *
>>  * It removes the row if the value in the expirationTimestamp column is
>> less than currentTime.
>>  *
>>  * TODO: The designation of the expirationTimestamp ColumnFamily and its
>> DateFormat is
>>  * set in the iterator options when the iterator is applied to the table.
>> (For
>>  * now it is hardcoded to match the format used in the Solr-Accumulo
>> plugin)
>>  */
>> public class ExpirationTimestampPurgeFilter extends RowFilter {
>>   private long currentTime;
>>   // TODO: make accumuloDateFormat settable via Iterator Options
>>   // Date Format for Expiration Timestamp ColumnFamily stored in Accumulo
>>   private String expTsDateFormat = "yyyyMMddHHmmssS";
>>   SimpleDateFormat df = new SimpleDateFormat(expTsDateFormat);
>>   // TODO: make expTs settable via Iterator Options
>>   // ColumnFamily containing Expiration Timestamp value (note ingest app
>>   // did NOT assign a ColumnQualifier, only a ColumnFamily)
>>   private String expTsColFam = "expTs";
>>   @Override
>>   public boolean acceptRow(SortedKeyValueIterator<Key, Value> rowIterator)
>>     throws IOException {
>>     if
>> (rowIterator.getTopKey().getColumnFamily().toString().equals(expTsColFam)) {
>>        Date expTsDate = null;
>>        try {
>>          expTsDate = df.parse(rowIterator.getTopValue().toString());
>>            if (expTsDate.getTime() < currentTime)
>>              return false;
>>        } catch (ParseException e) {
>>          // TODO Auto-generated catch block
>>          e.printStackTrace();
>>        }
>>     }
>>     return true;
>>   }
>>   @Override
>>   public void init(SortedKeyValueIterator<Key, Value> source,
>>       Map<String, String> options, IteratorEnvironment env) throws
>> IOException {
>>     super.init(source, options, env);
>>     currentTime = System.currentTimeMillis();
>>   }
>> }
>> On Tue, Nov 5, 2013 at 8:48 PM, William Slacum <
>>> wrote:
>>> If an iterator is only set at scan time, then its logic will only be
>>> applied when a client scans the table. The data will persist through major
>>> and minor compaction and be visible if you scanned the RFile(s) backing the
>>> table. "Suppress" is the better word in this case. Would you please open a
>>> ticket pointing us where to update the documentation?
>>> It looks like you'd want to implement a RowFilter for your use case. It
>>> has the necessary hooks to avoid reading a whole row into memory and
>>> handling the logic of determining whether or not to write keys that occur
>>> before the column you're filtering on (at the cost of reading those keys
>>> twice).
>>> On Tue, Nov 5, 2013 at 6:20 PM, Terry P. <> wrote:
>>>> Greetings everyone,
>>>> I'm looking at the AgeOffFilter as a base from which to write a
>>>> server-side filter / iterator to purge rows when they have aged off based
>>>> on the value of a specific column in the row (expiry datetime <= now).
>>>> this differs from the AgeOffFilter in that the criterion for removal is
>>>> from the same column in every row (not the Accumulo timestamp for an
>>>> individual entry), and we need to remove the entire row not just individual
>>>> entries. For example:
>>>> Format: Key:CF:CQ:Value
>>>> abc:data:title:"My fantastic data"
>>>> abc:data:content:<bytedata>
>>>> abc:data:creTs:2013-08-04T17:14:12Z
>>>> abc:data:*expTs*:2013-11-04T17:14:12Z
>>>> ... 6-8 more columns of data per row ...
>>>> where *expTs* is the column to determine if the entire row should be
>>>> removed based on whether its value is <= NOW.
>>>> This task seemed easy enough as a client program (and it is really),
>>>> but a server-side iterator would be far more efficient than sending
>>>> millions of rowkeys across the network just to delete them (we'll be
>>>> deleting more than a million every hour).  But I'm struggling to get there.
>>>> In looking at, is the "magic" in the AgeOffFilter
>>>> class that removes (deletes) an entry from a table the fact that the accept
>>>> method returns false, combined with the fact that the iterator would be set
>>>> to run at -majc or -minc time and it is the compaction code that actually
>>>> deletes the entry?  If set to run only at scan time, would AgeOffFilter
>>>> simply not return the rows during the scan, but not delete them?  The
>>>> wording in the iterator classes varies, some saying "remove" others say
>>>> "suppress" so it's not clear to me
>>>> If that's the case, then I think I know where to implement the logic.
>>>> The question is, how can I remove all the entries for the row once the
>>>> accept method has determined it meets the criteria?
>>>> Or as Mike Drob mentioned in a prior post, will basing my class on the
>>>> RowFilter class instead of just Filter make things easier?  Or the
>>>> WholeRowIterator?  Just trying to find the simplest solution.
>>>> Sorry for what may be obvious questions but I'm more of a DB Architect
>>>> that does some coding, and not a Java programmer by trade. With all of the
>>>> amazing things Accumulo does, honestly I was surprised when I couldn't find
>>>> a way to delete rows in the shell by criteria other than the rowkey!  I'm
>>>> more used to having a shell to 'delete from *table *where *column *<=
>>>> *value*'.
>>>> But looking at it now, everyone's criteria for deletion will likely be
>>>> different given the flexibility of a key=>value store.  If our rowkey
>>>> the date/timestamp as a prefix, I know an easy deletemany command in the
>>>> shell would do the trick -- but the nature of the data is such that
>>>> initially no expiration timestamp is set, and there is no means to update
>>>> the key from the client app when expiration timestamp finally gets set (too
>>>> much rework on that common tool I'm afraid).
>>>> Thanks in advance.

View raw message