accumulo-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Terry P." <>
Subject Re: How to remove entire row at the server side?
Date Thu, 07 Nov 2013 02:31:04 GMT
Thanks David, good to know.  After adding the implements OptionDescriber
the setiter command worked and it shows up right at the top.

On Wed, Nov 6, 2013 at 8:06 PM, David Medinets <>wrote:

> Just in case you didn't know there is a 'classpath' command in the
> Accumulo shell which should list your custom jar. It's handy to verify that
> it was loaded. I think there might also be a log entry if you have access
> to them. I've also found it useful to use 'jar tf <filename> on the
> Accumulo nodes to verify the jar file contents. Sometimes I've deployed the
> wrong version of a jar file.
> On Wed, Nov 6, 2013 at 7:56 PM, Billie Rinaldi <>wrote:
>> Making your class "extends RowFilter implements OptionDescriber" should
>> be fine.  One reason it might have been complaining about the @Override
>> annotations is if the Java compiler is set to 1.5 compatibility rather than
>> 1.6.
>> Regarding getting the same error, did you replace all the jars containing
>> your iterator on all the nodes?  If you did, perhaps it's not reloading the
>> jars properly.  You could restart accumulo to make sure it's using the
>> fresh jar, or you could try renaming your class and dropping it in with a
>> different jar name to ensure the new code is being picked up.
>> On Wed, Nov 6, 2013 at 2:50 PM, Terry P. <> wrote:
>>> Hi Billie,
>>> Many thanks for your help.  I added those two methods, but had to remove
>>> the @Override as the RowFilter class I'm extending from doesn't implement
>>> them.  Even with these methods in place, I still get the same error trying
>>> to add the iterator in the shell.
>>> I notice that the RowFilter class extends WrappingIterator, which also
>>> doesn't appear to have the describeOptions and validateOptions methods ...
>>> should I try extending from just the Filter class?  I didn't understand the
>>> benefits William listed of extending from the RowFilter class.  I just know
>>> that once I identify a RowKey should be purged based on its expTs ColFam
>>> Value, I want to remove all entries for that RowKey.
>>> On Wed, Nov 6, 2013 at 3:29 PM, Billie Rinaldi <
>>> > wrote:
>>>> To use setiter in the shell, your iterator must implement
>>>> OptionDescriber.  It has two methods, and something like the following
>>>> should work for your iterator.  If you implement passing options to the
>>>> iterator, you'll want to change the null parameters to the constructor of
>>>> IteratorOptions below, and probably also to do some validation in
>>>> validateOptions.
>>>>   @Override
>>>>   public IteratorOptions describeOptions() {
>>>>     return new IteratorOptions("expTs", "Removes rows based on the
>>>> column designated as the expiration timestamp column family", null, null);
>>>>   }
>>>>   @Override
>>>>   public boolean validateOptions(Map<String,String> options) {
>>>>     return true;
>>>>   }
>>>> On Wed, Nov 6, 2013 at 12:49 PM, Terry P. <> wrote:
>>>>> Eyes of an eagle Billie!  com is correct, but after viewing
>>>>> "org.apache.accumulo" so many times, my brain was stuck on org and I
>>>>> in my setiter syntax.
>>>>> With THAT corrected, here is the new error:
>>>>> root@meta> setiter -class
>>>>> com.esa.accumulo.iterators.ExpirationTimestampPurgeFilter -n expTsFilter
>>>>> 20 -scan -t itertest
>>>>> 2013-11-06 14:46:28,280 [shell.Shell] ERROR:
>>>>> Command could
>>>>> not be initialized (Unable to load
>>>>> com.esa.accumulo.iterators.ExpirationTimestampPurgeFilter as type
>>>>> org.apache.accumulo.core.iterators.OptionDescriber; configure with 'config'
>>>>> instead)
>>>>> On Wed, Nov 6, 2013 at 2:43 PM, Billie Rinaldi <
>>>>>> wrote:
>>>>>> Is there a typo in the package name?  One place says "com" and the
>>>>>> other "org".
>>>>>> On Wed, Nov 6, 2013 at 12:37 PM, Terry P. <>
>>>>>>> Hi William, many thanks for the explanation of scan time versus
>>>>>>> compaction time. I'll look through the classes again and note
where the
>>>>>>> remove versus suppress wordings are used and open a ticket.
>>>>>>> As mentioned, I only dabble in java, but regardless of that fact
>>>>>>> this point I'm the one that has to get this done. I've hobbled
together my
>>>>>>> first attempt, but I get the following error where I try to add
it as a
>>>>>>> scan iterator for testing:
>>>>>>> root@meta> setiter -class
>>>>>>> org.esa.accumulo.iterators.ExpirationTimestampPurgeFilter -n
expTsFilter -p
>>>>>>> 20 -scan -t itertest
>>>>>>> 2013-11-06 14:06:34,914 [shell.Shell] ERROR:
>>>>>>> Command
>>>>>>> not be initialized (Servers are unable to load
>>>>>>> org.esa.accumulo.iterators.ExpirationTimestampPurgeFilter as
>>>>>>> org.apache.accumulo.core.iterators.SortedKeyValueIterator)
>>>>>>> Here's my source.  Note that the value stored in the expTs ColFam
>>>>>>> in the format "yyyyMMddHHmmssS", which I convert to a long for
a direct
>>>>>>> comparison to System.currentTimeMillis(). I only overrode the
init and
>>>>>>> acceptRow methods, hoping the others would work as-is from the
base class.
>>>>>>> One clarification: turns out expTs is the ColumnFamily, and the
>>>>>>> ingest app does not assign a ColumnQualifier for expTs. So to
amend my
>>>>>>> prior table layout (including the datetime format):
>>>>>>> Format: Key:CF:CQ:Value
>>>>>>> abc:data:title:"My fantastic data"
>>>>>>> abc:data:content:<bytedata>
>>>>>>> abc:creTs::20130804171412445
>>>>>>> abc:*expTs*::20131104171412445
>>>>>>> ... 6-8 more columns of data per row ...
>>>>>>> where *expTs* is the ColumnFamily to determine if the entire
>>>>>>> should be removed based on whether its value is <= NOW.  If
a row has not
>>>>>>> yet been assigned an expiration date, expTs will not be set and
>>>>>>> ColumnFamily will not yet be present.  Seems like an odd choice
to use
>>>>>>> distinct Column Families, without Column Qualifiers, but that's
how the
>>>>>>> ingest app was done.
>>>>>>> I greatly appreciate any advice you can provide.
>>>>>>> package com.esa.accumulo.iterators;
>>>>>>> import;
>>>>>>> import java.text.ParseException;
>>>>>>> import java.text.SimpleDateFormat;
>>>>>>> import java.util.Date;
>>>>>>> import java.util.Map;
>>>>>>> import;
>>>>>>> import;
>>>>>>> import org.apache.accumulo.core.iterators.IteratorEnvironment;
>>>>>>> import org.apache.accumulo.core.iterators.SortedKeyValueIterator;
>>>>>>> import org.apache.accumulo.core.iterators.user.RowFilter;
>>>>>>> /**
>>>>>>>  * A filter that removes rows based on the column designated
as the
>>>>>>> "expiration timestamp" column family.
>>>>>>>  *
>>>>>>>  * It removes the row if the value in the expirationTimestamp
>>>>>>> is less than currentTime.
>>>>>>>  *
>>>>>>>  * TODO: The designation of the expirationTimestamp ColumnFamily
>>>>>>> its DateFormat is
>>>>>>>  * set in the iterator options when the iterator is applied to
>>>>>>> table. (For
>>>>>>>  * now it is hardcoded to match the format used in the Solr-Accumulo
>>>>>>> plugin)
>>>>>>>  */
>>>>>>> public class ExpirationTimestampPurgeFilter extends RowFilter
>>>>>>>   private long currentTime;
>>>>>>>   // TODO: make accumuloDateFormat settable via Iterator Options
>>>>>>>   // Date Format for Expiration Timestamp ColumnFamily stored
>>>>>>> Accumulo
>>>>>>>   private String expTsDateFormat = "yyyyMMddHHmmssS";
>>>>>>>   SimpleDateFormat df = new SimpleDateFormat(expTsDateFormat);
>>>>>>>   // TODO: make expTs settable via Iterator Options
>>>>>>>   // ColumnFamily containing Expiration Timestamp value (note
>>>>>>> app
>>>>>>>   // did NOT assign a ColumnQualifier, only a ColumnFamily)
>>>>>>>   private String expTsColFam = "expTs";
>>>>>>>   @Override
>>>>>>>   public boolean acceptRow(SortedKeyValueIterator<Key, Value>
>>>>>>> rowIterator)
>>>>>>>     throws IOException {
>>>>>>>     if
>>>>>>> (rowIterator.getTopKey().getColumnFamily().toString().equals(expTsColFam))
>>>>>>>        Date expTsDate = null;
>>>>>>>        try {
>>>>>>>          expTsDate = df.parse(rowIterator.getTopValue().toString());
>>>>>>>            if (expTsDate.getTime() < currentTime)
>>>>>>>              return false;
>>>>>>>        } catch (ParseException e) {
>>>>>>>          // TODO Auto-generated catch block
>>>>>>>          e.printStackTrace();
>>>>>>>        }
>>>>>>>     }
>>>>>>>     return true;
>>>>>>>   }
>>>>>>>   @Override
>>>>>>>   public void init(SortedKeyValueIterator<Key, Value> source,
>>>>>>>       Map<String, String> options, IteratorEnvironment
env) throws
>>>>>>> IOException {
>>>>>>>     super.init(source, options, env);
>>>>>>>     currentTime = System.currentTimeMillis();
>>>>>>>   }
>>>>>>> }
>>>>>>> On Tue, Nov 5, 2013 at 8:48 PM, William Slacum <
>>>>>>>> wrote:
>>>>>>>> If an iterator is only set at scan time, then its logic will
>>>>>>>> be applied when a client scans the table. The data will persist
>>>>>>>> major and minor compaction and be visible if you scanned
the RFile(s)
>>>>>>>> backing the table. "Suppress" is the better word in this
case. Would you
>>>>>>>> please open a ticket pointing us where to update the documentation?
>>>>>>>> It looks like you'd want to implement a RowFilter for your
>>>>>>>> case. It has the necessary hooks to avoid reading a whole
row into memory
>>>>>>>> and handling the logic of determining whether or not to write
keys that
>>>>>>>> occur before the column you're filtering on (at the cost
of reading those
>>>>>>>> keys twice).
>>>>>>>> On Tue, Nov 5, 2013 at 6:20 PM, Terry P. <>wrote:
>>>>>>>>> Greetings everyone,
>>>>>>>>> I'm looking at the AgeOffFilter as a base from which
to write a
>>>>>>>>> server-side filter / iterator to purge rows when they
have aged off based
>>>>>>>>> on the value of a specific column in the row (expiry
datetime <= now). So
>>>>>>>>> this differs from the AgeOffFilter in that the criterion
for removal is
>>>>>>>>> from the same column in every row (not the Accumulo timestamp
for an
>>>>>>>>> individual entry), and we need to remove the entire row
not just individual
>>>>>>>>> entries. For example:
>>>>>>>>> Format: Key:CF:CQ:Value
>>>>>>>>> abc:data:title:"My fantastic data"
>>>>>>>>> abc:data:content:<bytedata>
>>>>>>>>> abc:data:creTs:2013-08-04T17:14:12Z
>>>>>>>>> abc:data:*expTs*:2013-11-04T17:14:12Z
>>>>>>>>> ... 6-8 more columns of data per row ...
>>>>>>>>> where *expTs* is the column to determine if the entire
row should
>>>>>>>>> be removed based on whether its value is <= NOW.
>>>>>>>>> This task seemed easy enough as a client program (and
it is
>>>>>>>>> really), but a server-side iterator would be far more
efficient than
>>>>>>>>> sending millions of rowkeys across the network just to
delete them (we'll
>>>>>>>>> be deleting more than a million every hour).  But I'm
struggling to get
>>>>>>>>> there.
>>>>>>>>> In looking at, is the "magic" in the
>>>>>>>>> AgeOffFilter class that removes (deletes) an entry from
a table the fact
>>>>>>>>> that the accept method returns false, combined with the
fact that the
>>>>>>>>> iterator would be set to run at -majc or -minc time and
it is the
>>>>>>>>> compaction code that actually deletes the entry?  If
set to run only at
>>>>>>>>> scan time, would AgeOffFilter simply not return the rows
during the scan,
>>>>>>>>> but not delete them?  The wording in the iterator classes
varies, some
>>>>>>>>> saying "remove" others say "suppress" so it's not clear
to me
>>>>>>>>> If that's the case, then I think I know where to implement
>>>>>>>>> logic. The question is, how can I remove all the entries
for the row once
>>>>>>>>> the accept method has determined it meets the criteria?
>>>>>>>>> Or as Mike Drob mentioned in a prior post, will basing
my class on
>>>>>>>>> the RowFilter class instead of just Filter make things
easier?  Or the
>>>>>>>>> WholeRowIterator?  Just trying to find the simplest solution.
>>>>>>>>> Sorry for what may be obvious questions but I'm more
of a DB
>>>>>>>>> Architect that does some coding, and not a Java programmer
by trade. With
>>>>>>>>> all of the amazing things Accumulo does, honestly I was
surprised when I
>>>>>>>>> couldn't find a way to delete rows in the shell by criteria
other than the
>>>>>>>>> rowkey!  I'm more used to having a shell to 'delete from
*table *where
>>>>>>>>> *column *<= *value*'.
>>>>>>>>> But looking at it now, everyone's criteria for deletion
>>>>>>>>> likely be different given the flexibility of a key=>value
store.  If our
>>>>>>>>> rowkey had the date/timestamp as a prefix, I know an
easy deletemany
>>>>>>>>> command in the shell would do the trick -- but the nature
of the data is
>>>>>>>>> such that initially no expiration timestamp is set, and
there is no means
>>>>>>>>> to update the key from the client app when expiration
timestamp finally
>>>>>>>>> gets set (too much rework on that common tool I'm afraid).
>>>>>>>>> Thanks in advance.

View raw message