accumulo-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Billie Rinaldi <billie.rina...@gmail.com>
Subject Re: How to remove entire row at the server side?
Date Wed, 06 Nov 2013 21:29:28 GMT
To use setiter in the shell, your iterator must implement OptionDescriber.
It has two methods, and something like the following should work for your
iterator.  If you implement passing options to the iterator, you'll want to
change the null parameters to the constructor of IteratorOptions below, and
probably also to do some validation in validateOptions.

  @Override
  public IteratorOptions describeOptions() {
    return new IteratorOptions("expTs", "Removes rows based on the column
designated as the expiration timestamp column family", null, null);
  }

  @Override
  public boolean validateOptions(Map<String,String> options) {
    return true;
  }



On Wed, Nov 6, 2013 at 12:49 PM, Terry P. <texpilot@gmail.com> wrote:

> Eyes of an eagle Billie!  com is correct, but after viewing
> "org.apache.accumulo" so many times, my brain was stuck on org and I goofed
> in my setiter syntax.
>
> With THAT corrected, here is the new error:
>
> root@meta> setiter -class
> com.esa.accumulo.iterators.ExpirationTimestampPurgeFilter -n expTsFilter -p
> 20 -scan -t itertest
> 2013-11-06 14:46:28,280 [shell.Shell] ERROR:
> org.apache.accumulo.core.util.shell.ShellCommandException: Command could
> not be initialized (Unable to load
> com.esa.accumulo.iterators.ExpirationTimestampPurgeFilter as type
> org.apache.accumulo.core.iterators.OptionDescriber; configure with 'config'
> instead)
>
>
>
>
>
> On Wed, Nov 6, 2013 at 2:43 PM, Billie Rinaldi <billie.rinaldi@gmail.com>wrote:
>
>> Is there a typo in the package name?  One place says "com" and the other
>> "org".
>>
>>
>> On Wed, Nov 6, 2013 at 12:37 PM, Terry P. <texpilot@gmail.com> wrote:
>>
>>> Hi William, many thanks for the explanation of scan time versus
>>> compaction time. I'll look through the classes again and note where the
>>> remove versus suppress wordings are used and open a ticket.
>>>
>>> As mentioned, I only dabble in java, but regardless of that fact at this
>>> point I'm the one that has to get this done. I've hobbled together my first
>>> attempt, but I get the following error where I try to add it as a scan
>>> iterator for testing:
>>>
>>> root@meta> setiter -class
>>> org.esa.accumulo.iterators.ExpirationTimestampPurgeFilter -n expTsFilter -p
>>> 20 -scan -t itertest
>>> 2013-11-06 14:06:34,914 [shell.Shell] ERROR:
>>> org.apache.accumulo.core.util.shell.ShellCommandException: Command could
>>> not be initialized (Servers are unable to load
>>> org.esa.accumulo.iterators.ExpirationTimestampPurgeFilter as type
>>> org.apache.accumulo.core.iterators.SortedKeyValueIterator)
>>>
>>> Here's my source.  Note that the value stored in the expTs ColFam is in
>>> the format "yyyyMMddHHmmssS", which I convert to a long for a direct
>>> comparison to System.currentTimeMillis(). I only overrode the init and
>>> acceptRow methods, hoping the others would work as-is from the base class.
>>>
>>> One clarification: turns out expTs is the ColumnFamily, and the ingest
>>> app does not assign a ColumnQualifier for expTs. So to amend my prior table
>>> layout (including the datetime format):
>>>
>>>
>>> Format: Key:CF:CQ:Value
>>> abc:data:title:"My fantastic data"
>>> abc:data:content:<bytedata>
>>> abc:creTs::20130804171412445
>>> abc:*expTs*::20131104171412445
>>> ... 6-8 more columns of data per row ...
>>>
>>> where *expTs* is the ColumnFamily to determine if the entire row should
>>> be removed based on whether its value is <= NOW.  If a row has not yet been
>>> assigned an expiration date, expTs will not be set and the ColumnFamily
>>> will not yet be present.  Seems like an odd choice to use distinct Column
>>> Families, without Column Qualifiers, but that's how the ingest app was done.
>>>
>>> I greatly appreciate any advice you can provide.
>>>
>>> package com.esa.accumulo.iterators;
>>>
>>> import java.io.IOException;
>>> import java.text.ParseException;
>>> import java.text.SimpleDateFormat;
>>> import java.util.Date;
>>> import java.util.Map;
>>>
>>> import org.apache.accumulo.core.data.Key;
>>> import org.apache.accumulo.core.data.Value;
>>> import org.apache.accumulo.core.iterators.IteratorEnvironment;
>>> import org.apache.accumulo.core.iterators.SortedKeyValueIterator;
>>> import org.apache.accumulo.core.iterators.user.RowFilter;
>>>
>>> /**
>>>  * A filter that removes rows based on the column designated as the
>>> "expiration timestamp" column family.
>>>  *
>>>  * It removes the row if the value in the expirationTimestamp column is
>>> less than currentTime.
>>>  *
>>>  * TODO: The designation of the expirationTimestamp ColumnFamily and its
>>> DateFormat is
>>>  * set in the iterator options when the iterator is applied to the
>>> table. (For
>>>  * now it is hardcoded to match the format used in the Solr-Accumulo
>>> plugin)
>>>  */
>>> public class ExpirationTimestampPurgeFilter extends RowFilter {
>>>   private long currentTime;
>>>   // TODO: make accumuloDateFormat settable via Iterator Options
>>>   // Date Format for Expiration Timestamp ColumnFamily stored in Accumulo
>>>   private String expTsDateFormat = "yyyyMMddHHmmssS";
>>>   SimpleDateFormat df = new SimpleDateFormat(expTsDateFormat);
>>>
>>>   // TODO: make expTs settable via Iterator Options
>>>   // ColumnFamily containing Expiration Timestamp value (note ingest app
>>>   // did NOT assign a ColumnQualifier, only a ColumnFamily)
>>>   private String expTsColFam = "expTs";
>>>
>>>   @Override
>>>   public boolean acceptRow(SortedKeyValueIterator<Key, Value>
>>> rowIterator)
>>>     throws IOException {
>>>
>>>     if
>>> (rowIterator.getTopKey().getColumnFamily().toString().equals(expTsColFam)) {
>>>        Date expTsDate = null;
>>>        try {
>>>          expTsDate = df.parse(rowIterator.getTopValue().toString());
>>>            if (expTsDate.getTime() < currentTime)
>>>              return false;
>>>        } catch (ParseException e) {
>>>          // TODO Auto-generated catch block
>>>          e.printStackTrace();
>>>        }
>>>     }
>>>     return true;
>>>   }
>>>
>>>   @Override
>>>   public void init(SortedKeyValueIterator<Key, Value> source,
>>>       Map<String, String> options, IteratorEnvironment env) throws
>>> IOException {
>>>     super.init(source, options, env);
>>>     currentTime = System.currentTimeMillis();
>>>   }
>>>
>>> }
>>>
>>>
>>>
>>> On Tue, Nov 5, 2013 at 8:48 PM, William Slacum <
>>> wilhelm.von.cloud@accumulo.net> wrote:
>>>
>>>> If an iterator is only set at scan time, then its logic will only be
>>>> applied when a client scans the table. The data will persist through major
>>>> and minor compaction and be visible if you scanned the RFile(s) backing the
>>>> table. "Suppress" is the better word in this case. Would you please open
a
>>>> ticket pointing us where to update the documentation?
>>>>
>>>> It looks like you'd want to implement a RowFilter for your use case. It
>>>> has the necessary hooks to avoid reading a whole row into memory and
>>>> handling the logic of determining whether or not to write keys that occur
>>>> before the column you're filtering on (at the cost of reading those keys
>>>> twice).
>>>>
>>>>
>>>> On Tue, Nov 5, 2013 at 6:20 PM, Terry P. <texpilot@gmail.com> wrote:
>>>>
>>>>> Greetings everyone,
>>>>> I'm looking at the AgeOffFilter as a base from which to write a
>>>>> server-side filter / iterator to purge rows when they have aged off based
>>>>> on the value of a specific column in the row (expiry datetime <= now).
So
>>>>> this differs from the AgeOffFilter in that the criterion for removal
is
>>>>> from the same column in every row (not the Accumulo timestamp for an
>>>>> individual entry), and we need to remove the entire row not just individual
>>>>> entries. For example:
>>>>>
>>>>> Format: Key:CF:CQ:Value
>>>>> abc:data:title:"My fantastic data"
>>>>> abc:data:content:<bytedata>
>>>>> abc:data:creTs:2013-08-04T17:14:12Z
>>>>> abc:data:*expTs*:2013-11-04T17:14:12Z
>>>>> ... 6-8 more columns of data per row ...
>>>>>
>>>>> where *expTs* is the column to determine if the entire row should be
>>>>> removed based on whether its value is <= NOW.
>>>>>
>>>>> This task seemed easy enough as a client program (and it is really),
>>>>> but a server-side iterator would be far more efficient than sending
>>>>> millions of rowkeys across the network just to delete them (we'll be
>>>>> deleting more than a million every hour).  But I'm struggling to get
there.
>>>>>
>>>>> In looking at AgeOffFilter.java, is the "magic" in the AgeOffFilter
>>>>> class that removes (deletes) an entry from a table the fact that the
accept
>>>>> method returns false, combined with the fact that the iterator would
be set
>>>>> to run at -majc or -minc time and it is the compaction code that actually
>>>>> deletes the entry?  If set to run only at scan time, would AgeOffFilter
>>>>> simply not return the rows during the scan, but not delete them?  The
>>>>> wording in the iterator classes varies, some saying "remove" others say
>>>>> "suppress" so it's not clear to me
>>>>>
>>>>> If that's the case, then I think I know where to implement the logic.
>>>>> The question is, how can I remove all the entries for the row once the
>>>>> accept method has determined it meets the criteria?
>>>>>
>>>>> Or as Mike Drob mentioned in a prior post, will basing my class on the
>>>>> RowFilter class instead of just Filter make things easier?  Or the
>>>>> WholeRowIterator?  Just trying to find the simplest solution.
>>>>>
>>>>> Sorry for what may be obvious questions but I'm more of a DB Architect
>>>>> that does some coding, and not a Java programmer by trade. With all of
the
>>>>> amazing things Accumulo does, honestly I was surprised when I couldn't
find
>>>>> a way to delete rows in the shell by criteria other than the rowkey!
 I'm
>>>>> more used to having a shell to 'delete from *table *where *column *<=
>>>>> *value*'.
>>>>>
>>>>> But looking at it now, everyone's criteria for deletion will likely be
>>>>> different given the flexibility of a key=>value store.  If our rowkey
had
>>>>> the date/timestamp as a prefix, I know an easy deletemany command in
the
>>>>> shell would do the trick -- but the nature of the data is such that
>>>>> initially no expiration timestamp is set, and there is no means to update
>>>>> the key from the client app when expiration timestamp finally gets set
(too
>>>>> much rework on that common tool I'm afraid).
>>>>>
>>>>> Thanks in advance.
>>>>>
>>>>
>>>>
>>>
>>
>

Mime
View raw message