accumulo-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Terry P." <texpi...@gmail.com>
Subject Re: How to remove entire row at the server side?
Date Wed, 06 Nov 2013 20:37:28 GMT
Hi William, many thanks for the explanation of scan time versus compaction
time. I'll look through the classes again and note where the remove versus
suppress wordings are used and open a ticket.

As mentioned, I only dabble in java, but regardless of that fact at this
point I'm the one that has to get this done. I've hobbled together my first
attempt, but I get the following error where I try to add it as a scan
iterator for testing:

root@meta> setiter -class
org.esa.accumulo.iterators.ExpirationTimestampPurgeFilter -n expTsFilter -p
20 -scan -t itertest
2013-11-06 14:06:34,914 [shell.Shell] ERROR:
org.apache.accumulo.core.util.shell.ShellCommandException: Command could
not be initialized (Servers are unable to load
org.esa.accumulo.iterators.ExpirationTimestampPurgeFilter as type
org.apache.accumulo.core.iterators.SortedKeyValueIterator)

Here's my source.  Note that the value stored in the expTs ColFam is in the
format "yyyyMMddHHmmssS", which I convert to a long for a direct comparison
to System.currentTimeMillis(). I only overrode the init and acceptRow
methods, hoping the others would work as-is from the base class.

One clarification: turns out expTs is the ColumnFamily, and the ingest app
does not assign a ColumnQualifier for expTs. So to amend my prior table
layout (including the datetime format):

Format: Key:CF:CQ:Value
abc:data:title:"My fantastic data"
abc:data:content:<bytedata>
abc:creTs::20130804171412445
abc:*expTs*::20131104171412445
... 6-8 more columns of data per row ...

where *expTs* is the ColumnFamily to determine if the entire row should be
removed based on whether its value is <= NOW.  If a row has not yet been
assigned an expiration date, expTs will not be set and the ColumnFamily
will not yet be present.  Seems like an odd choice to use distinct Column
Families, without Column Qualifiers, but that's how the ingest app was done.

I greatly appreciate any advice you can provide.

package com.esa.accumulo.iterators;

import java.io.IOException;
import java.text.ParseException;
import java.text.SimpleDateFormat;
import java.util.Date;
import java.util.Map;

import org.apache.accumulo.core.data.Key;
import org.apache.accumulo.core.data.Value;
import org.apache.accumulo.core.iterators.IteratorEnvironment;
import org.apache.accumulo.core.iterators.SortedKeyValueIterator;
import org.apache.accumulo.core.iterators.user.RowFilter;

/**
 * A filter that removes rows based on the column designated as the
"expiration timestamp" column family.
 *
 * It removes the row if the value in the expirationTimestamp column is
less than currentTime.
 *
 * TODO: The designation of the expirationTimestamp ColumnFamily and its
DateFormat is
 * set in the iterator options when the iterator is applied to the table.
(For
 * now it is hardcoded to match the format used in the Solr-Accumulo plugin)
 */
public class ExpirationTimestampPurgeFilter extends RowFilter {
  private long currentTime;
  // TODO: make accumuloDateFormat settable via Iterator Options
  // Date Format for Expiration Timestamp ColumnFamily stored in Accumulo
  private String expTsDateFormat = "yyyyMMddHHmmssS";
  SimpleDateFormat df = new SimpleDateFormat(expTsDateFormat);

  // TODO: make expTs settable via Iterator Options
  // ColumnFamily containing Expiration Timestamp value (note ingest app
  // did NOT assign a ColumnQualifier, only a ColumnFamily)
  private String expTsColFam = "expTs";

  @Override
  public boolean acceptRow(SortedKeyValueIterator<Key, Value> rowIterator)
    throws IOException {

    if
(rowIterator.getTopKey().getColumnFamily().toString().equals(expTsColFam)) {
       Date expTsDate = null;
       try {
         expTsDate = df.parse(rowIterator.getTopValue().toString());
           if (expTsDate.getTime() < currentTime)
             return false;
       } catch (ParseException e) {
         // TODO Auto-generated catch block
         e.printStackTrace();
       }
    }
    return true;
  }

  @Override
  public void init(SortedKeyValueIterator<Key, Value> source,
      Map<String, String> options, IteratorEnvironment env) throws
IOException {
    super.init(source, options, env);
    currentTime = System.currentTimeMillis();
  }

}


On Tue, Nov 5, 2013 at 8:48 PM, William Slacum <
wilhelm.von.cloud@accumulo.net> wrote:

> If an iterator is only set at scan time, then its logic will only be
> applied when a client scans the table. The data will persist through major
> and minor compaction and be visible if you scanned the RFile(s) backing the
> table. "Suppress" is the better word in this case. Would you please open a
> ticket pointing us where to update the documentation?
>
> It looks like you'd want to implement a RowFilter for your use case. It
> has the necessary hooks to avoid reading a whole row into memory and
> handling the logic of determining whether or not to write keys that occur
> before the column you're filtering on (at the cost of reading those keys
> twice).
>
>
> On Tue, Nov 5, 2013 at 6:20 PM, Terry P. <texpilot@gmail.com> wrote:
>
>> Greetings everyone,
>> I'm looking at the AgeOffFilter as a base from which to write a
>> server-side filter / iterator to purge rows when they have aged off based
>> on the value of a specific column in the row (expiry datetime <= now). So
>> this differs from the AgeOffFilter in that the criterion for removal is
>> from the same column in every row (not the Accumulo timestamp for an
>> individual entry), and we need to remove the entire row not just individual
>> entries. For example:
>>
>> Format: Key:CF:CQ:Value
>> abc:data:title:"My fantastic data"
>> abc:data:content:<bytedata>
>> abc:data:creTs:2013-08-04T17:14:12Z
>> abc:data:*expTs*:2013-11-04T17:14:12Z
>> ... 6-8 more columns of data per row ...
>>
>> where *expTs* is the column to determine if the entire row should be
>> removed based on whether its value is <= NOW.
>>
>> This task seemed easy enough as a client program (and it is really), but
>> a server-side iterator would be far more efficient than sending millions of
>> rowkeys across the network just to delete them (we'll be deleting more than
>> a million every hour).  But I'm struggling to get there.
>>
>> In looking at AgeOffFilter.java, is the "magic" in the AgeOffFilter class
>> that removes (deletes) an entry from a table the fact that the accept
>> method returns false, combined with the fact that the iterator would be set
>> to run at -majc or -minc time and it is the compaction code that actually
>> deletes the entry?  If set to run only at scan time, would AgeOffFilter
>> simply not return the rows during the scan, but not delete them?  The
>> wording in the iterator classes varies, some saying "remove" others say
>> "suppress" so it's not clear to me
>>
>> If that's the case, then I think I know where to implement the logic. The
>> question is, how can I remove all the entries for the row once the accept
>> method has determined it meets the criteria?
>>
>> Or as Mike Drob mentioned in a prior post, will basing my class on the
>> RowFilter class instead of just Filter make things easier?  Or the
>> WholeRowIterator?  Just trying to find the simplest solution.
>>
>> Sorry for what may be obvious questions but I'm more of a DB Architect
>> that does some coding, and not a Java programmer by trade. With all of the
>> amazing things Accumulo does, honestly I was surprised when I couldn't find
>> a way to delete rows in the shell by criteria other than the rowkey!  I'm
>> more used to having a shell to 'delete from *table *where *column *<=
>> *value*'.
>>
>> But looking at it now, everyone's criteria for deletion will likely be
>> different given the flexibility of a key=>value store.  If our rowkey had
>> the date/timestamp as a prefix, I know an easy deletemany command in the
>> shell would do the trick -- but the nature of the data is such that
>> initially no expiration timestamp is set, and there is no means to update
>> the key from the client app when expiration timestamp finally gets set (too
>> much rework on that common tool I'm afraid).
>>
>> Thanks in advance.
>>
>
>

Mime
View raw message