accumulo-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Billie Rinaldi <billie.rina...@gmail.com>
Subject Re: How to remove entire row at the server side?
Date Thu, 07 Nov 2013 00:56:07 GMT
Making your class "extends RowFilter implements OptionDescriber" should be
fine.  One reason it might have been complaining about the @Override
annotations is if the Java compiler is set to 1.5 compatibility rather than
1.6.

Regarding getting the same error, did you replace all the jars containing
your iterator on all the nodes?  If you did, perhaps it's not reloading the
jars properly.  You could restart accumulo to make sure it's using the
fresh jar, or you could try renaming your class and dropping it in with a
different jar name to ensure the new code is being picked up.

On Wed, Nov 6, 2013 at 2:50 PM, Terry P. <texpilot@gmail.com> wrote:

> Hi Billie,
> Many thanks for your help.  I added those two methods, but had to remove
> the @Override as the RowFilter class I'm extending from doesn't implement
> them.  Even with these methods in place, I still get the same error trying
> to add the iterator in the shell.
>
> I notice that the RowFilter class extends WrappingIterator, which also
> doesn't appear to have the describeOptions and validateOptions methods ...
> should I try extending from just the Filter class?  I didn't understand the
> benefits William listed of extending from the RowFilter class.  I just know
> that once I identify a RowKey should be purged based on its expTs ColFam
> Value, I want to remove all entries for that RowKey.
>
>
> On Wed, Nov 6, 2013 at 3:29 PM, Billie Rinaldi <billie.rinaldi@gmail.com>wrote:
>
>> To use setiter in the shell, your iterator must implement
>> OptionDescriber.  It has two methods, and something like the following
>> should work for your iterator.  If you implement passing options to the
>> iterator, you'll want to change the null parameters to the constructor of
>> IteratorOptions below, and probably also to do some validation in
>> validateOptions.
>>
>>   @Override
>>   public IteratorOptions describeOptions() {
>>     return new IteratorOptions("expTs", "Removes rows based on the column
>> designated as the expiration timestamp column family", null, null);
>>   }
>>
>>   @Override
>>   public boolean validateOptions(Map<String,String> options) {
>>     return true;
>>   }
>>
>>
>>
>> On Wed, Nov 6, 2013 at 12:49 PM, Terry P. <texpilot@gmail.com> wrote:
>>
>>> Eyes of an eagle Billie!  com is correct, but after viewing
>>> "org.apache.accumulo" so many times, my brain was stuck on org and I goofed
>>> in my setiter syntax.
>>>
>>> With THAT corrected, here is the new error:
>>>
>>> root@meta> setiter -class
>>> com.esa.accumulo.iterators.ExpirationTimestampPurgeFilter -n expTsFilter -p
>>> 20 -scan -t itertest
>>> 2013-11-06 14:46:28,280 [shell.Shell] ERROR:
>>> org.apache.accumulo.core.util.shell.ShellCommandException: Command could
>>> not be initialized (Unable to load
>>> com.esa.accumulo.iterators.ExpirationTimestampPurgeFilter as type
>>> org.apache.accumulo.core.iterators.OptionDescriber; configure with 'config'
>>> instead)
>>>
>>>
>>>
>>>
>>>
>>> On Wed, Nov 6, 2013 at 2:43 PM, Billie Rinaldi <billie.rinaldi@gmail.com
>>> > wrote:
>>>
>>>> Is there a typo in the package name?  One place says "com" and the
>>>> other "org".
>>>>
>>>>
>>>> On Wed, Nov 6, 2013 at 12:37 PM, Terry P. <texpilot@gmail.com> wrote:
>>>>
>>>>> Hi William, many thanks for the explanation of scan time versus
>>>>> compaction time. I'll look through the classes again and note where the
>>>>> remove versus suppress wordings are used and open a ticket.
>>>>>
>>>>> As mentioned, I only dabble in java, but regardless of that fact at
>>>>> this point I'm the one that has to get this done. I've hobbled together
my
>>>>> first attempt, but I get the following error where I try to add it as
a
>>>>> scan iterator for testing:
>>>>>
>>>>> root@meta> setiter -class
>>>>> org.esa.accumulo.iterators.ExpirationTimestampPurgeFilter -n expTsFilter
-p
>>>>> 20 -scan -t itertest
>>>>> 2013-11-06 14:06:34,914 [shell.Shell] ERROR:
>>>>> org.apache.accumulo.core.util.shell.ShellCommandException: Command could
>>>>> not be initialized (Servers are unable to load
>>>>> org.esa.accumulo.iterators.ExpirationTimestampPurgeFilter as type
>>>>> org.apache.accumulo.core.iterators.SortedKeyValueIterator)
>>>>>
>>>>> Here's my source.  Note that the value stored in the expTs ColFam is
>>>>> in the format "yyyyMMddHHmmssS", which I convert to a long for a direct
>>>>> comparison to System.currentTimeMillis(). I only overrode the init and
>>>>> acceptRow methods, hoping the others would work as-is from the base class.
>>>>>
>>>>> One clarification: turns out expTs is the ColumnFamily, and the ingest
>>>>> app does not assign a ColumnQualifier for expTs. So to amend my prior
table
>>>>> layout (including the datetime format):
>>>>>
>>>>>
>>>>> Format: Key:CF:CQ:Value
>>>>> abc:data:title:"My fantastic data"
>>>>> abc:data:content:<bytedata>
>>>>> abc:creTs::20130804171412445
>>>>> abc:*expTs*::20131104171412445
>>>>> ... 6-8 more columns of data per row ...
>>>>>
>>>>> where *expTs* is the ColumnFamily to determine if the entire row
>>>>> should be removed based on whether its value is <= NOW.  If a row
has not
>>>>> yet been assigned an expiration date, expTs will not be set and the
>>>>> ColumnFamily will not yet be present.  Seems like an odd choice to use
>>>>> distinct Column Families, without Column Qualifiers, but that's how the
>>>>> ingest app was done.
>>>>>
>>>>> I greatly appreciate any advice you can provide.
>>>>>
>>>>> package com.esa.accumulo.iterators;
>>>>>
>>>>> import java.io.IOException;
>>>>> import java.text.ParseException;
>>>>> import java.text.SimpleDateFormat;
>>>>> import java.util.Date;
>>>>> import java.util.Map;
>>>>>
>>>>> import org.apache.accumulo.core.data.Key;
>>>>> import org.apache.accumulo.core.data.Value;
>>>>> import org.apache.accumulo.core.iterators.IteratorEnvironment;
>>>>> import org.apache.accumulo.core.iterators.SortedKeyValueIterator;
>>>>> import org.apache.accumulo.core.iterators.user.RowFilter;
>>>>>
>>>>> /**
>>>>>  * A filter that removes rows based on the column designated as the
>>>>> "expiration timestamp" column family.
>>>>>  *
>>>>>  * It removes the row if the value in the expirationTimestamp column
>>>>> is less than currentTime.
>>>>>  *
>>>>>  * TODO: The designation of the expirationTimestamp ColumnFamily and
>>>>> its DateFormat is
>>>>>  * set in the iterator options when the iterator is applied to the
>>>>> table. (For
>>>>>  * now it is hardcoded to match the format used in the Solr-Accumulo
>>>>> plugin)
>>>>>  */
>>>>> public class ExpirationTimestampPurgeFilter extends RowFilter {
>>>>>   private long currentTime;
>>>>>   // TODO: make accumuloDateFormat settable via Iterator Options
>>>>>   // Date Format for Expiration Timestamp ColumnFamily stored in
>>>>> Accumulo
>>>>>   private String expTsDateFormat = "yyyyMMddHHmmssS";
>>>>>   SimpleDateFormat df = new SimpleDateFormat(expTsDateFormat);
>>>>>
>>>>>   // TODO: make expTs settable via Iterator Options
>>>>>   // ColumnFamily containing Expiration Timestamp value (note ingest
>>>>> app
>>>>>   // did NOT assign a ColumnQualifier, only a ColumnFamily)
>>>>>   private String expTsColFam = "expTs";
>>>>>
>>>>>   @Override
>>>>>   public boolean acceptRow(SortedKeyValueIterator<Key, Value>
>>>>> rowIterator)
>>>>>     throws IOException {
>>>>>
>>>>>     if
>>>>> (rowIterator.getTopKey().getColumnFamily().toString().equals(expTsColFam))
{
>>>>>        Date expTsDate = null;
>>>>>        try {
>>>>>          expTsDate = df.parse(rowIterator.getTopValue().toString());
>>>>>            if (expTsDate.getTime() < currentTime)
>>>>>              return false;
>>>>>        } catch (ParseException e) {
>>>>>          // TODO Auto-generated catch block
>>>>>          e.printStackTrace();
>>>>>        }
>>>>>     }
>>>>>     return true;
>>>>>   }
>>>>>
>>>>>   @Override
>>>>>   public void init(SortedKeyValueIterator<Key, Value> source,
>>>>>       Map<String, String> options, IteratorEnvironment env) throws
>>>>> IOException {
>>>>>     super.init(source, options, env);
>>>>>     currentTime = System.currentTimeMillis();
>>>>>   }
>>>>>
>>>>> }
>>>>>
>>>>>
>>>>>
>>>>> On Tue, Nov 5, 2013 at 8:48 PM, William Slacum <
>>>>> wilhelm.von.cloud@accumulo.net> wrote:
>>>>>
>>>>>> If an iterator is only set at scan time, then its logic will only
be
>>>>>> applied when a client scans the table. The data will persist through
major
>>>>>> and minor compaction and be visible if you scanned the RFile(s) backing
the
>>>>>> table. "Suppress" is the better word in this case. Would you please
open a
>>>>>> ticket pointing us where to update the documentation?
>>>>>>
>>>>>> It looks like you'd want to implement a RowFilter for your use case.
>>>>>> It has the necessary hooks to avoid reading a whole row into memory
and
>>>>>> handling the logic of determining whether or not to write keys that
occur
>>>>>> before the column you're filtering on (at the cost of reading those
keys
>>>>>> twice).
>>>>>>
>>>>>>
>>>>>> On Tue, Nov 5, 2013 at 6:20 PM, Terry P. <texpilot@gmail.com>
wrote:
>>>>>>
>>>>>>> Greetings everyone,
>>>>>>> I'm looking at the AgeOffFilter as a base from which to write
a
>>>>>>> server-side filter / iterator to purge rows when they have aged
off based
>>>>>>> on the value of a specific column in the row (expiry datetime
<= now). So
>>>>>>> this differs from the AgeOffFilter in that the criterion for
removal is
>>>>>>> from the same column in every row (not the Accumulo timestamp
for an
>>>>>>> individual entry), and we need to remove the entire row not just
individual
>>>>>>> entries. For example:
>>>>>>>
>>>>>>> Format: Key:CF:CQ:Value
>>>>>>> abc:data:title:"My fantastic data"
>>>>>>> abc:data:content:<bytedata>
>>>>>>> abc:data:creTs:2013-08-04T17:14:12Z
>>>>>>> abc:data:*expTs*:2013-11-04T17:14:12Z
>>>>>>> ... 6-8 more columns of data per row ...
>>>>>>>
>>>>>>> where *expTs* is the column to determine if the entire row should
>>>>>>> be removed based on whether its value is <= NOW.
>>>>>>>
>>>>>>> This task seemed easy enough as a client program (and it is really),
>>>>>>> but a server-side iterator would be far more efficient than sending
>>>>>>> millions of rowkeys across the network just to delete them (we'll
be
>>>>>>> deleting more than a million every hour).  But I'm struggling
to get there.
>>>>>>>
>>>>>>> In looking at AgeOffFilter.java, is the "magic" in the AgeOffFilter
>>>>>>> class that removes (deletes) an entry from a table the fact that
the accept
>>>>>>> method returns false, combined with the fact that the iterator
would be set
>>>>>>> to run at -majc or -minc time and it is the compaction code that
actually
>>>>>>> deletes the entry?  If set to run only at scan time, would AgeOffFilter
>>>>>>> simply not return the rows during the scan, but not delete them?
 The
>>>>>>> wording in the iterator classes varies, some saying "remove"
others say
>>>>>>> "suppress" so it's not clear to me
>>>>>>>
>>>>>>> If that's the case, then I think I know where to implement the
>>>>>>> logic. The question is, how can I remove all the entries for
the row once
>>>>>>> the accept method has determined it meets the criteria?
>>>>>>>
>>>>>>> Or as Mike Drob mentioned in a prior post, will basing my class
on
>>>>>>> the RowFilter class instead of just Filter make things easier?
 Or the
>>>>>>> WholeRowIterator?  Just trying to find the simplest solution.
>>>>>>>
>>>>>>> Sorry for what may be obvious questions but I'm more of a DB
>>>>>>> Architect that does some coding, and not a Java programmer by
trade. With
>>>>>>> all of the amazing things Accumulo does, honestly I was surprised
when I
>>>>>>> couldn't find a way to delete rows in the shell by criteria other
than the
>>>>>>> rowkey!  I'm more used to having a shell to 'delete from *table
*where
>>>>>>> *column *<= *value*'.
>>>>>>>
>>>>>>> But looking at it now, everyone's criteria for deletion will
likely
>>>>>>> be different given the flexibility of a key=>value store.
 If our rowkey
>>>>>>> had the date/timestamp as a prefix, I know an easy deletemany
command in
>>>>>>> the shell would do the trick -- but the nature of the data is
such that
>>>>>>> initially no expiration timestamp is set, and there is no means
to update
>>>>>>> the key from the client app when expiration timestamp finally
gets set (too
>>>>>>> much rework on that common tool I'm afraid).
>>>>>>>
>>>>>>> Thanks in advance.
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Mime
View raw message