Return-Path: X-Original-To: apmail-accumulo-user-archive@www.apache.org Delivered-To: apmail-accumulo-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 3A7EE1097A for ; Wed, 6 Nov 2013 21:29:54 +0000 (UTC) Received: (qmail 39986 invoked by uid 500); 6 Nov 2013 21:29:54 -0000 Delivered-To: apmail-accumulo-user-archive@accumulo.apache.org Received: (qmail 39943 invoked by uid 500); 6 Nov 2013 21:29:54 -0000 Mailing-List: contact user-help@accumulo.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@accumulo.apache.org Delivered-To: mailing list user@accumulo.apache.org Received: (qmail 39935 invoked by uid 99); 6 Nov 2013 21:29:53 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 06 Nov 2013 21:29:53 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of billie.rinaldi@gmail.com designates 209.85.219.44 as permitted sender) Received: from [209.85.219.44] (HELO mail-oa0-f44.google.com) (209.85.219.44) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 06 Nov 2013 21:29:49 +0000 Received: by mail-oa0-f44.google.com with SMTP id i7so115259oag.3 for ; Wed, 06 Nov 2013 13:29:28 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; bh=xrVMmJCp+Y9u8Mi41F1TK0LBAYS+bTwkdjT4o1EiaqQ=; b=SfMShNFpBJv4bZK9Ha9DxcuLIzfjiS9lUzTSRe2gmdnIdAlgKTIgsY60UJT8IVHom2 D0E7qmDg5GxQhAld9+MMlf+VVNvJqqMBzkIdFDL2C5r3I6yZ5ln+O+rfQvlokNYkHrMi kz/053A9QyXn1lsqO3sybNpvGufvwLA/0tgYOCK+k78kS3Wf4FBtG2f9ffDarPPQ7xOc cTBqmorYJiwn+0/moFDTRT55zhkCyy69Ny6nRcRBG28OBaXAc+3uUSqbvPON83Bn34C5 H9OnMpHrzCI6ahaE0f0UHe1mgmxv7neW4HQ54/sfKAZq5alklgYhcIl0qYHsunFONPtl Rw+Q== MIME-Version: 1.0 X-Received: by 10.60.138.136 with SMTP id qq8mr2819059oeb.59.1383773368404; Wed, 06 Nov 2013 13:29:28 -0800 (PST) Received: by 10.76.131.162 with HTTP; Wed, 6 Nov 2013 13:29:28 -0800 (PST) In-Reply-To: References: Date: Wed, 6 Nov 2013 13:29:28 -0800 Message-ID: Subject: Re: How to remove entire row at the server side? From: Billie Rinaldi To: "user@accumulo.apache.org" Content-Type: multipart/alternative; boundary=047d7b41ccf812d3c404ea88d869 X-Virus-Checked: Checked by ClamAV on apache.org --047d7b41ccf812d3c404ea88d869 Content-Type: text/plain; charset=ISO-8859-1 To use setiter in the shell, your iterator must implement OptionDescriber. It has two methods, and something like the following should work for your iterator. If you implement passing options to the iterator, you'll want to change the null parameters to the constructor of IteratorOptions below, and probably also to do some validation in validateOptions. @Override public IteratorOptions describeOptions() { return new IteratorOptions("expTs", "Removes rows based on the column designated as the expiration timestamp column family", null, null); } @Override public boolean validateOptions(Map options) { return true; } On Wed, Nov 6, 2013 at 12:49 PM, Terry P. wrote: > Eyes of an eagle Billie! com is correct, but after viewing > "org.apache.accumulo" so many times, my brain was stuck on org and I goofed > in my setiter syntax. > > With THAT corrected, here is the new error: > > root@meta> setiter -class > com.esa.accumulo.iterators.ExpirationTimestampPurgeFilter -n expTsFilter -p > 20 -scan -t itertest > 2013-11-06 14:46:28,280 [shell.Shell] ERROR: > org.apache.accumulo.core.util.shell.ShellCommandException: Command could > not be initialized (Unable to load > com.esa.accumulo.iterators.ExpirationTimestampPurgeFilter as type > org.apache.accumulo.core.iterators.OptionDescriber; configure with 'config' > instead) > > > > > > On Wed, Nov 6, 2013 at 2:43 PM, Billie Rinaldi wrote: > >> Is there a typo in the package name? One place says "com" and the other >> "org". >> >> >> On Wed, Nov 6, 2013 at 12:37 PM, Terry P. wrote: >> >>> Hi William, many thanks for the explanation of scan time versus >>> compaction time. I'll look through the classes again and note where the >>> remove versus suppress wordings are used and open a ticket. >>> >>> As mentioned, I only dabble in java, but regardless of that fact at this >>> point I'm the one that has to get this done. I've hobbled together my first >>> attempt, but I get the following error where I try to add it as a scan >>> iterator for testing: >>> >>> root@meta> setiter -class >>> org.esa.accumulo.iterators.ExpirationTimestampPurgeFilter -n expTsFilter -p >>> 20 -scan -t itertest >>> 2013-11-06 14:06:34,914 [shell.Shell] ERROR: >>> org.apache.accumulo.core.util.shell.ShellCommandException: Command could >>> not be initialized (Servers are unable to load >>> org.esa.accumulo.iterators.ExpirationTimestampPurgeFilter as type >>> org.apache.accumulo.core.iterators.SortedKeyValueIterator) >>> >>> Here's my source. Note that the value stored in the expTs ColFam is in >>> the format "yyyyMMddHHmmssS", which I convert to a long for a direct >>> comparison to System.currentTimeMillis(). I only overrode the init and >>> acceptRow methods, hoping the others would work as-is from the base class. >>> >>> One clarification: turns out expTs is the ColumnFamily, and the ingest >>> app does not assign a ColumnQualifier for expTs. So to amend my prior table >>> layout (including the datetime format): >>> >>> >>> Format: Key:CF:CQ:Value >>> abc:data:title:"My fantastic data" >>> abc:data:content: >>> abc:creTs::20130804171412445 >>> abc:*expTs*::20131104171412445 >>> ... 6-8 more columns of data per row ... >>> >>> where *expTs* is the ColumnFamily to determine if the entire row should >>> be removed based on whether its value is <= NOW. If a row has not yet been >>> assigned an expiration date, expTs will not be set and the ColumnFamily >>> will not yet be present. Seems like an odd choice to use distinct Column >>> Families, without Column Qualifiers, but that's how the ingest app was done. >>> >>> I greatly appreciate any advice you can provide. >>> >>> package com.esa.accumulo.iterators; >>> >>> import java.io.IOException; >>> import java.text.ParseException; >>> import java.text.SimpleDateFormat; >>> import java.util.Date; >>> import java.util.Map; >>> >>> import org.apache.accumulo.core.data.Key; >>> import org.apache.accumulo.core.data.Value; >>> import org.apache.accumulo.core.iterators.IteratorEnvironment; >>> import org.apache.accumulo.core.iterators.SortedKeyValueIterator; >>> import org.apache.accumulo.core.iterators.user.RowFilter; >>> >>> /** >>> * A filter that removes rows based on the column designated as the >>> "expiration timestamp" column family. >>> * >>> * It removes the row if the value in the expirationTimestamp column is >>> less than currentTime. >>> * >>> * TODO: The designation of the expirationTimestamp ColumnFamily and its >>> DateFormat is >>> * set in the iterator options when the iterator is applied to the >>> table. (For >>> * now it is hardcoded to match the format used in the Solr-Accumulo >>> plugin) >>> */ >>> public class ExpirationTimestampPurgeFilter extends RowFilter { >>> private long currentTime; >>> // TODO: make accumuloDateFormat settable via Iterator Options >>> // Date Format for Expiration Timestamp ColumnFamily stored in Accumulo >>> private String expTsDateFormat = "yyyyMMddHHmmssS"; >>> SimpleDateFormat df = new SimpleDateFormat(expTsDateFormat); >>> >>> // TODO: make expTs settable via Iterator Options >>> // ColumnFamily containing Expiration Timestamp value (note ingest app >>> // did NOT assign a ColumnQualifier, only a ColumnFamily) >>> private String expTsColFam = "expTs"; >>> >>> @Override >>> public boolean acceptRow(SortedKeyValueIterator >>> rowIterator) >>> throws IOException { >>> >>> if >>> (rowIterator.getTopKey().getColumnFamily().toString().equals(expTsColFam)) { >>> Date expTsDate = null; >>> try { >>> expTsDate = df.parse(rowIterator.getTopValue().toString()); >>> if (expTsDate.getTime() < currentTime) >>> return false; >>> } catch (ParseException e) { >>> // TODO Auto-generated catch block >>> e.printStackTrace(); >>> } >>> } >>> return true; >>> } >>> >>> @Override >>> public void init(SortedKeyValueIterator source, >>> Map options, IteratorEnvironment env) throws >>> IOException { >>> super.init(source, options, env); >>> currentTime = System.currentTimeMillis(); >>> } >>> >>> } >>> >>> >>> >>> On Tue, Nov 5, 2013 at 8:48 PM, William Slacum < >>> wilhelm.von.cloud@accumulo.net> wrote: >>> >>>> If an iterator is only set at scan time, then its logic will only be >>>> applied when a client scans the table. The data will persist through major >>>> and minor compaction and be visible if you scanned the RFile(s) backing the >>>> table. "Suppress" is the better word in this case. Would you please open a >>>> ticket pointing us where to update the documentation? >>>> >>>> It looks like you'd want to implement a RowFilter for your use case. It >>>> has the necessary hooks to avoid reading a whole row into memory and >>>> handling the logic of determining whether or not to write keys that occur >>>> before the column you're filtering on (at the cost of reading those keys >>>> twice). >>>> >>>> >>>> On Tue, Nov 5, 2013 at 6:20 PM, Terry P. wrote: >>>> >>>>> Greetings everyone, >>>>> I'm looking at the AgeOffFilter as a base from which to write a >>>>> server-side filter / iterator to purge rows when they have aged off based >>>>> on the value of a specific column in the row (expiry datetime <= now). So >>>>> this differs from the AgeOffFilter in that the criterion for removal is >>>>> from the same column in every row (not the Accumulo timestamp for an >>>>> individual entry), and we need to remove the entire row not just individual >>>>> entries. For example: >>>>> >>>>> Format: Key:CF:CQ:Value >>>>> abc:data:title:"My fantastic data" >>>>> abc:data:content: >>>>> abc:data:creTs:2013-08-04T17:14:12Z >>>>> abc:data:*expTs*:2013-11-04T17:14:12Z >>>>> ... 6-8 more columns of data per row ... >>>>> >>>>> where *expTs* is the column to determine if the entire row should be >>>>> removed based on whether its value is <= NOW. >>>>> >>>>> This task seemed easy enough as a client program (and it is really), >>>>> but a server-side iterator would be far more efficient than sending >>>>> millions of rowkeys across the network just to delete them (we'll be >>>>> deleting more than a million every hour). But I'm struggling to get there. >>>>> >>>>> In looking at AgeOffFilter.java, is the "magic" in the AgeOffFilter >>>>> class that removes (deletes) an entry from a table the fact that the accept >>>>> method returns false, combined with the fact that the iterator would be set >>>>> to run at -majc or -minc time and it is the compaction code that actually >>>>> deletes the entry? If set to run only at scan time, would AgeOffFilter >>>>> simply not return the rows during the scan, but not delete them? The >>>>> wording in the iterator classes varies, some saying "remove" others say >>>>> "suppress" so it's not clear to me >>>>> >>>>> If that's the case, then I think I know where to implement the logic. >>>>> The question is, how can I remove all the entries for the row once the >>>>> accept method has determined it meets the criteria? >>>>> >>>>> Or as Mike Drob mentioned in a prior post, will basing my class on the >>>>> RowFilter class instead of just Filter make things easier? Or the >>>>> WholeRowIterator? Just trying to find the simplest solution. >>>>> >>>>> Sorry for what may be obvious questions but I'm more of a DB Architect >>>>> that does some coding, and not a Java programmer by trade. With all of the >>>>> amazing things Accumulo does, honestly I was surprised when I couldn't find >>>>> a way to delete rows in the shell by criteria other than the rowkey! I'm >>>>> more used to having a shell to 'delete from *table *where *column *<= >>>>> *value*'. >>>>> >>>>> But looking at it now, everyone's criteria for deletion will likely be >>>>> different given the flexibility of a key=>value store. If our rowkey had >>>>> the date/timestamp as a prefix, I know an easy deletemany command in the >>>>> shell would do the trick -- but the nature of the data is such that >>>>> initially no expiration timestamp is set, and there is no means to update >>>>> the key from the client app when expiration timestamp finally gets set (too >>>>> much rework on that common tool I'm afraid). >>>>> >>>>> Thanks in advance. >>>>> >>>> >>>> >>> >> > --047d7b41ccf812d3c404ea88d869 Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable
To use setiter in the shell, your iterator must implement = OptionDescriber.=A0 It has two methods, and something like the following sh= ould work for your iterator.=A0 If you implement passing options to the ite= rator, you'll want to change the null parameters to the constructor of = IteratorOptions below, and probably also to do some validation in validateO= ptions.

=A0 @Override
=A0 public IteratorOptions describeOptions() {
=A0= =A0=A0 return new IteratorOptions("expTs", "Removes rows bas= ed on the column designated as the expiration timestamp column family"= , null, null);
=A0 }

=A0 @Override
=A0 public boolean validateOptions(Map<Str= ing,String> options) {
=A0=A0=A0 return true;
=A0 }

<= div class=3D"gmail_extra">

On Wed, Nov 6,= 2013 at 12:49 PM, Terry P. <texpilot@gmail.com> wrote:
Eyes of an eagle Billi= e!=A0 com is correct, but after viewing "org.apache.accumulo" so = many times, my brain was stuck on org and I goofed in my setiter syntax.
With THAT corrected, here is the new error:

root@meta> setiter -class com.esa.accumulo.iterators.ExpirationTimes= tampPurgeFilter -n expTsFilter -p 20 -scan -t itertest
2013-11-06 14:46:= 28,280 [shell.Shell] ERROR: org.apache.accumulo.core.util.shell.ShellComman= dException: Command could not be initialized (Unable to load com.esa.accumu= lo.iterators.ExpirationTimestampPurgeFilter as type org.apache.accumulo.cor= e.iterators.OptionDescriber; configure with 'config' instead)





On Wed, Nov 6, 2013 at 2:43 PM= , Billie Rinaldi <billie.rinaldi@gmail.com> wrote:
Is there a typo in the pack= age name?=A0 One place says "com" and the other "org".<= br>


On = Wed, Nov 6, 2013 at 12:37 PM, Terry P. <texpilot@gmail.com>= wrote:
Hi William, = many thanks for the explanation of scan time versus compaction time. I'= ll look through the classes again and note where the remove versus suppress= wordings are used and open a ticket.

As mentioned, I only dabble in java, but regardless of that = fact at this point I'm the one that has to get this done. I've hobb= led together my first attempt, but I get the following error where I try to= add it as a scan iterator for testing:

root@meta> setiter= -class org.esa.accumulo.iterators.ExpirationTimestampPurgeFilter -n expTsF= ilter -p 20 -scan -t itertest
2013-11-06 14:06:34,914 [shell.Shell] ERRO= R: org.apache.accumulo.core.util.shell.ShellCommandException: Command could= not be initialized (Servers are unable to load org.esa.accumulo.iterators.= ExpirationTimestampPurgeFilter as type org.apache.accumulo.core.iterators.S= ortedKeyValueIterator)


Here's my source.=A0 Note that the value stored in the expTs = ColFam is in the format "yyyyMMddHHmmssS", which I convert to a l= ong for a direct comparison to System.currentTimeMillis(). I only overrode = the init and acceptRow methods, hoping the others would work as-is from the= base class.

One clarification: turns out expTs is the ColumnFamily, and the i= ngest app does not assign a ColumnQualifier for expTs. So to amend my prior= table layout (including the datetime format):


Format: Key:CF:CQ:Value
abc:data:title:"My fantastic data"
abc:d= ata:content:<bytedata>
abc:creTs::20130804171412445
abc:<= b>expTs::20131104171412445
... 6-8 more columns of data per row ...

where expTs is the ColumnFamily to determine if the entire row= should be removed based on whether its value is <=3D NOW.=A0 If a row h= as not yet been assigned an expiration date, expTs will not be set and the = ColumnFamily will not yet be present.=A0 Seems like an odd choice to use di= stinct Column Families, without Column Qualifiers, but that's how the i= ngest app was done.

I greatly appreciate any advice you can provide.

package com.esa.accumulo.iter= ators;

import java.io.IOException;
import java.text.ParseExceptio= n;
import java.text.SimpleDateFormat;
import java.util.Date;
import java= .util.Map;

import org.apache.accumulo.core.data.Key;
import org.a= pache.accumulo.core.data.Value;
import org.apache.accumulo.core.iterator= s.IteratorEnvironment;
import org.apache.accumulo.core.iterators.SortedKeyValueIterator;
import= org.apache.accumulo.core.iterators.user.RowFilter;

/**
=A0* A fi= lter that removes rows based on the column designated as the "expirati= on timestamp" column family.
=A0*
=A0* It removes the row if the value in the expirationTimestamp co= lumn is less than currentTime.
=A0*
=A0* TODO: The designation of th= e expirationTimestamp ColumnFamily and its DateFormat is
=A0* set in the= iterator options when the iterator is applied to the table. (For
=A0* now it is hardcoded to match the format used in the Solr-Accumulo plug= in)
=A0*/
public class ExpirationTimestampPurgeFilter extends RowFilt= er {
=A0 private long currentTime;
=A0 // TODO: make accumuloDateForm= at settable via Iterator Options
=A0 // Date Format for Expiration Timestamp ColumnFamily stored in Accumulo=
=A0 private String expTsDateFormat =3D "yyyyMMddHHmmssS";
= =A0 SimpleDateFormat df =3D new SimpleDateFormat(expTsDateFormat);

= =A0 // TODO: make expTs settable via Iterator Options
=A0 // ColumnFamily containing Expiration Timestamp value (note ingest app<= br>=A0 // did NOT assign a ColumnQualifier, only a ColumnFamily)
=A0 pri= vate String expTsColFam =3D "expTs";

=A0 @Override
=A0 = public boolean acceptRow(SortedKeyValueIterator<Key, Value> rowIterat= or)
=A0=A0=A0 throws IOException {

=A0=A0=A0 if (rowIterator.getTopKey()= .getColumnFamily().toString().equals(expTsColFam)) {
=A0=A0 =A0=A0=A0 Da= te expTsDate =3D null;
=A0=A0 =A0=A0=A0 try {
=A0=A0=A0=A0 =A0=A0=A0 = expTsDate =3D df.parse(rowIterator.getTopValue().toString());
=A0=A0 =A0=A0=A0 =A0=A0=A0 if (expTsDate.getTime() < currentTime)
=A0= =A0=A0=A0 =A0=A0=A0 =A0=A0=A0 return false;
=A0=A0 =A0=A0=A0 } catch (Pa= rseException e) {
=A0=A0=A0=A0 =A0=A0=A0 // TODO Auto-generated catch bl= ock
=A0=A0=A0=A0 =A0=A0=A0 e.printStackTrace();
=A0=A0 =A0=A0=A0 }=A0=A0=A0 }
=A0=A0=A0 return true;
=A0 }

=A0 @Override
=A0 public void ini= t(SortedKeyValueIterator<Key, Value> source,
=A0=A0=A0=A0=A0 Map&l= t;String, String> options, IteratorEnvironment env) throws IOException {=
=A0=A0=A0 super.init(source, options, env);
=A0=A0=A0 currentTime =3D System.currentTimeMillis();
=A0 }

}



On Tue, Nov 5, 2013 at 8:48 PM, William Slacum &= lt;wilh= elm.von.cloud@accumulo.net> wrote:
If an it= erator is only set at scan time, then its logic will only be applied when a= client scans the table. The data will persist through major and minor comp= action and be visible if you scanned the RFile(s) backing the table. "= Suppress" is the better word in this case. Would you please open a tic= ket pointing us where to update the documentation?

It looks like you'd want to implement a RowFilter for yo= ur use case. It has the necessary hooks to avoid reading a whole row into m= emory and handling the logic of determining whether or not to write keys th= at occur before the column you're filtering on (at the cost of reading = those keys twice).


On Tue, Nov 5, 2013 at 6:20 PM, Terry P. <= texpilot@gmail.com> wrote:
Greetings everyone,
I'm looking at the AgeOffFilter as = a base from which to write a server-side filter / iterator to purge rows wh= en they have aged off based on the value of a specific column in the row (e= xpiry datetime <=3D now). So this differs from the AgeOffFilter in that = the criterion for removal is from the same column in every row (not the Acc= umulo timestamp for an individual entry), and we need to remove the entire = row not just individual entries. For example:

Format: Key:CF:CQ:Value
abc:data:title:"M= y fantastic data"
abc:data:content:<bytedata>
abc:data:cre= Ts:2013-08-04T17:14:12Z
abc:data:expTs:2013-11-04T17:14:12Z
... 6-8 more columns of data per row ...

where= expTs is the column to determine if the entire row should be remove= d based on whether its value is <=3D NOW.

This task se= emed easy enough as a client program (and it is really), but a server-side = iterator would be far more efficient than sending millions of rowkeys acros= s the network just to delete them (we'll be deleting more than a millio= n every hour).=A0 But I'm struggling to get there.

In looking at AgeOffFilter.java, is the "magic" in the = AgeOffFilter class that removes (deletes) an entry from a table the fact th= at the accept method returns false, combined with the fact that the iterato= r would be set to run at -majc or -minc time and it is the compaction code = that actually deletes the entry?=A0 If set to run only at scan time, would = AgeOffFilter simply not return the rows during the scan, but not delete the= m?=A0 The wording in the iterator classes varies, some saying "remove&= quot; others say "suppress" so it's not clear to me

If that's the case, then I think I know where to implement the logi= c. The question is, how can I remove all the entries for the row once the a= ccept method has determined it meets the criteria?

Or as = Mike Drob mentioned in a prior post, will basing my class on the RowFilter = class instead of just Filter make things easier?=A0 Or the WholeRowIterator= ?=A0 Just trying to find the simplest solution.

Sorry for what may be obvious questions but I'= ;m more of a DB Architect that does some coding, and not a Java programmer = by trade. With all of the amazing things Accumulo does, honestly I was surp= rised when I couldn't find a way to delete rows in the shell by criteri= a other than the rowkey!=A0 I'm more used to having a shell to 'del= ete from table where column <=3D value'.=A0
But looking at it now, everyone's criteria for deletion will likely= be different given the flexibility of a key=3D>value store.=A0 If our r= owkey had the date/timestamp as a prefix, I know an easy deletemany command= in the shell would do the trick -- but the nature of the data is such that= initially no expiration timestamp is set, and there is no means to update = the key from the client app when expiration timestamp finally gets set (too= much rework on that common tool I'm afraid).

Thanks in advance.





--047d7b41ccf812d3c404ea88d869--