Return-Path: X-Original-To: apmail-accumulo-user-archive@www.apache.org Delivered-To: apmail-accumulo-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 6C359107C0 for ; Wed, 6 Nov 2013 20:44:23 +0000 (UTC) Received: (qmail 28917 invoked by uid 500); 6 Nov 2013 20:44:23 -0000 Delivered-To: apmail-accumulo-user-archive@accumulo.apache.org Received: (qmail 28868 invoked by uid 500); 6 Nov 2013 20:44:23 -0000 Mailing-List: contact user-help@accumulo.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@accumulo.apache.org Delivered-To: mailing list user@accumulo.apache.org Received: (qmail 28860 invoked by uid 99); 6 Nov 2013 20:44:23 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 06 Nov 2013 20:44:23 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of billie.rinaldi@gmail.com designates 209.85.219.54 as permitted sender) Received: from [209.85.219.54] (HELO mail-oa0-f54.google.com) (209.85.219.54) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 06 Nov 2013 20:44:16 +0000 Received: by mail-oa0-f54.google.com with SMTP id n16so49032oag.27 for ; Wed, 06 Nov 2013 12:43:55 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; bh=WXdkJxuYyI4r18F3N8hPzXNtcAPWmiW6OhRWpyqX3UU=; b=EN/Qhpi8LauI3rVbsCqTM1eZA+FXI9eNURWjvmg2orJSg7Sx7MiN9+C2QRDDuWpZ/X 1jAqQ/VCAlbYxnDRWXmKziitObSsL5ijMaqo1sh9635V8RbNfP2rG0bKpZqsfwijS+ev zLbmdSM8gsDHVklyShNrTBUAkXtgfutB8e6aCAiH22e34rTqZzIbZIHkrWUcx0Rnw9rX qNjMhjtb6U8OJ/F4uJoM+dqE4BHNT8KJdExfuCLZmtd5biAfT2JAh6xC0ut5q7N332l7 ttjddhegOIhrLAr3Jp2U745irkqccdxpo8/zk84UfYVYS+g0/NM8IUXL1++ddou0eVlR KEIw== MIME-Version: 1.0 X-Received: by 10.182.66.164 with SMTP id g4mr4035351obt.47.1383770635286; Wed, 06 Nov 2013 12:43:55 -0800 (PST) Received: by 10.76.131.162 with HTTP; Wed, 6 Nov 2013 12:43:55 -0800 (PST) In-Reply-To: References: Date: Wed, 6 Nov 2013 12:43:55 -0800 Message-ID: Subject: Re: How to remove entire row at the server side? From: Billie Rinaldi To: "user@accumulo.apache.org" Content-Type: multipart/alternative; boundary=089e0160c35e2abfdc04ea8835c5 X-Virus-Checked: Checked by ClamAV on apache.org --089e0160c35e2abfdc04ea8835c5 Content-Type: text/plain; charset=ISO-8859-1 Is there a typo in the package name? One place says "com" and the other "org". On Wed, Nov 6, 2013 at 12:37 PM, Terry P. wrote: > Hi William, many thanks for the explanation of scan time versus compaction > time. I'll look through the classes again and note where the remove versus > suppress wordings are used and open a ticket. > > As mentioned, I only dabble in java, but regardless of that fact at this > point I'm the one that has to get this done. I've hobbled together my first > attempt, but I get the following error where I try to add it as a scan > iterator for testing: > > root@meta> setiter -class > org.esa.accumulo.iterators.ExpirationTimestampPurgeFilter -n expTsFilter -p > 20 -scan -t itertest > 2013-11-06 14:06:34,914 [shell.Shell] ERROR: > org.apache.accumulo.core.util.shell.ShellCommandException: Command could > not be initialized (Servers are unable to load > org.esa.accumulo.iterators.ExpirationTimestampPurgeFilter as type > org.apache.accumulo.core.iterators.SortedKeyValueIterator) > > Here's my source. Note that the value stored in the expTs ColFam is in > the format "yyyyMMddHHmmssS", which I convert to a long for a direct > comparison to System.currentTimeMillis(). I only overrode the init and > acceptRow methods, hoping the others would work as-is from the base class. > > One clarification: turns out expTs is the ColumnFamily, and the ingest app > does not assign a ColumnQualifier for expTs. So to amend my prior table > layout (including the datetime format): > > > Format: Key:CF:CQ:Value > abc:data:title:"My fantastic data" > abc:data:content: > abc:creTs::20130804171412445 > abc:*expTs*::20131104171412445 > ... 6-8 more columns of data per row ... > > where *expTs* is the ColumnFamily to determine if the entire row should > be removed based on whether its value is <= NOW. If a row has not yet been > assigned an expiration date, expTs will not be set and the ColumnFamily > will not yet be present. Seems like an odd choice to use distinct Column > Families, without Column Qualifiers, but that's how the ingest app was done. > > I greatly appreciate any advice you can provide. > > package com.esa.accumulo.iterators; > > import java.io.IOException; > import java.text.ParseException; > import java.text.SimpleDateFormat; > import java.util.Date; > import java.util.Map; > > import org.apache.accumulo.core.data.Key; > import org.apache.accumulo.core.data.Value; > import org.apache.accumulo.core.iterators.IteratorEnvironment; > import org.apache.accumulo.core.iterators.SortedKeyValueIterator; > import org.apache.accumulo.core.iterators.user.RowFilter; > > /** > * A filter that removes rows based on the column designated as the > "expiration timestamp" column family. > * > * It removes the row if the value in the expirationTimestamp column is > less than currentTime. > * > * TODO: The designation of the expirationTimestamp ColumnFamily and its > DateFormat is > * set in the iterator options when the iterator is applied to the table. > (For > * now it is hardcoded to match the format used in the Solr-Accumulo > plugin) > */ > public class ExpirationTimestampPurgeFilter extends RowFilter { > private long currentTime; > // TODO: make accumuloDateFormat settable via Iterator Options > // Date Format for Expiration Timestamp ColumnFamily stored in Accumulo > private String expTsDateFormat = "yyyyMMddHHmmssS"; > SimpleDateFormat df = new SimpleDateFormat(expTsDateFormat); > > // TODO: make expTs settable via Iterator Options > // ColumnFamily containing Expiration Timestamp value (note ingest app > // did NOT assign a ColumnQualifier, only a ColumnFamily) > private String expTsColFam = "expTs"; > > @Override > public boolean acceptRow(SortedKeyValueIterator rowIterator) > throws IOException { > > if > (rowIterator.getTopKey().getColumnFamily().toString().equals(expTsColFam)) { > Date expTsDate = null; > try { > expTsDate = df.parse(rowIterator.getTopValue().toString()); > if (expTsDate.getTime() < currentTime) > return false; > } catch (ParseException e) { > // TODO Auto-generated catch block > e.printStackTrace(); > } > } > return true; > } > > @Override > public void init(SortedKeyValueIterator source, > Map options, IteratorEnvironment env) throws > IOException { > super.init(source, options, env); > currentTime = System.currentTimeMillis(); > } > > } > > > > On Tue, Nov 5, 2013 at 8:48 PM, William Slacum < > wilhelm.von.cloud@accumulo.net> wrote: > >> If an iterator is only set at scan time, then its logic will only be >> applied when a client scans the table. The data will persist through major >> and minor compaction and be visible if you scanned the RFile(s) backing the >> table. "Suppress" is the better word in this case. Would you please open a >> ticket pointing us where to update the documentation? >> >> It looks like you'd want to implement a RowFilter for your use case. It >> has the necessary hooks to avoid reading a whole row into memory and >> handling the logic of determining whether or not to write keys that occur >> before the column you're filtering on (at the cost of reading those keys >> twice). >> >> >> On Tue, Nov 5, 2013 at 6:20 PM, Terry P. wrote: >> >>> Greetings everyone, >>> I'm looking at the AgeOffFilter as a base from which to write a >>> server-side filter / iterator to purge rows when they have aged off based >>> on the value of a specific column in the row (expiry datetime <= now). So >>> this differs from the AgeOffFilter in that the criterion for removal is >>> from the same column in every row (not the Accumulo timestamp for an >>> individual entry), and we need to remove the entire row not just individual >>> entries. For example: >>> >>> Format: Key:CF:CQ:Value >>> abc:data:title:"My fantastic data" >>> abc:data:content: >>> abc:data:creTs:2013-08-04T17:14:12Z >>> abc:data:*expTs*:2013-11-04T17:14:12Z >>> ... 6-8 more columns of data per row ... >>> >>> where *expTs* is the column to determine if the entire row should be >>> removed based on whether its value is <= NOW. >>> >>> This task seemed easy enough as a client program (and it is really), but >>> a server-side iterator would be far more efficient than sending millions of >>> rowkeys across the network just to delete them (we'll be deleting more than >>> a million every hour). But I'm struggling to get there. >>> >>> In looking at AgeOffFilter.java, is the "magic" in the AgeOffFilter >>> class that removes (deletes) an entry from a table the fact that the accept >>> method returns false, combined with the fact that the iterator would be set >>> to run at -majc or -minc time and it is the compaction code that actually >>> deletes the entry? If set to run only at scan time, would AgeOffFilter >>> simply not return the rows during the scan, but not delete them? The >>> wording in the iterator classes varies, some saying "remove" others say >>> "suppress" so it's not clear to me >>> >>> If that's the case, then I think I know where to implement the logic. >>> The question is, how can I remove all the entries for the row once the >>> accept method has determined it meets the criteria? >>> >>> Or as Mike Drob mentioned in a prior post, will basing my class on the >>> RowFilter class instead of just Filter make things easier? Or the >>> WholeRowIterator? Just trying to find the simplest solution. >>> >>> Sorry for what may be obvious questions but I'm more of a DB Architect >>> that does some coding, and not a Java programmer by trade. With all of the >>> amazing things Accumulo does, honestly I was surprised when I couldn't find >>> a way to delete rows in the shell by criteria other than the rowkey! I'm >>> more used to having a shell to 'delete from *table *where *column *<= >>> *value*'. >>> >>> But looking at it now, everyone's criteria for deletion will likely be >>> different given the flexibility of a key=>value store. If our rowkey had >>> the date/timestamp as a prefix, I know an easy deletemany command in the >>> shell would do the trick -- but the nature of the data is such that >>> initially no expiration timestamp is set, and there is no means to update >>> the key from the client app when expiration timestamp finally gets set (too >>> much rework on that common tool I'm afraid). >>> >>> Thanks in advance. >>> >> >> > --089e0160c35e2abfdc04ea8835c5 Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable
Is there a typo in the package name?=A0 One place says &qu= ot;com" and the other "org".


On Wed, Nov 6, 2013 at 12:37 PM, Te= rry P. <texpilot@gmail.com> wrote:
Hi William, = many thanks for the explanation of scan time versus compaction time. I'= ll look through the classes again and note where the remove versus suppress= wordings are used and open a ticket.

As mentioned, I only dabble in java, but regardless of that = fact at this point I'm the one that has to get this done. I've hobb= led together my first attempt, but I get the following error where I try to= add it as a scan iterator for testing:

root@meta> setiter= -class org.esa.accumulo.iterators.ExpirationTimestampPurgeFilter -n expTsF= ilter -p 20 -scan -t itertest
2013-11-06 14:06:34,914 [shell.Shell] ERRO= R: org.apache.accumulo.core.util.shell.ShellCommandException: Command could= not be initialized (Servers are unable to load org.esa.accumulo.iterators.= ExpirationTimestampPurgeFilter as type org.apache.accumulo.core.iterators.S= ortedKeyValueIterator)


Here's my source.=A0 Note that the value stored in the expTs = ColFam is in the format "yyyyMMddHHmmssS", which I convert to a l= ong for a direct comparison to System.currentTimeMillis(). I only overrode = the init and acceptRow methods, hoping the others would work as-is from the= base class.

One clarification: turns out expTs is the ColumnFamily, and the i= ngest app does not assign a ColumnQualifier for expTs. So to amend my prior= table layout (including the datetime format):


Format: Key:CF:CQ:Value
abc:data:title:"My fantastic data&q= uot;
abc:data:content:<bytedata>
abc:creTs::2013080417141= 2445
abc:expTs::20131104171412445
... 6-8 more columns of data per row ...

where expTs is the ColumnFamily to determine if t= he entire row should be removed based on whether its value is <=3D NOW.= =A0 If a row has not yet been assigned an expiration date, expTs will not b= e set and the ColumnFamily will not yet be present.=A0 Seems like an odd ch= oice to use distinct Column Families, without Column Qualifiers, but that&#= 39;s how the ingest app was done.

I greatly appreciate any advice you can provide.

package com.esa.accumulo.iter= ators;

import java.io.IOException;
import java.text.ParseExceptio= n;
import java.text.SimpleDateFormat;
import java.util.Date;
import java= .util.Map;

import org.apache.accumulo.core.data.Key;
import org.a= pache.accumulo.core.data.Value;
import org.apache.accumulo.core.iterator= s.IteratorEnvironment;
import org.apache.accumulo.core.iterators.SortedKeyValueIterator;
import= org.apache.accumulo.core.iterators.user.RowFilter;

/**
=A0* A fi= lter that removes rows based on the column designated as the "expirati= on timestamp" column family.
=A0*
=A0* It removes the row if the value in the expirationTimestamp co= lumn is less than currentTime.
=A0*
=A0* TODO: The designation of th= e expirationTimestamp ColumnFamily and its DateFormat is
=A0* set in the= iterator options when the iterator is applied to the table. (For
=A0* now it is hardcoded to match the format used in the Solr-Accumulo plug= in)
=A0*/
public class ExpirationTimestampPurgeFilter extends RowFilt= er {
=A0 private long currentTime;
=A0 // TODO: make accumuloDateForm= at settable via Iterator Options
=A0 // Date Format for Expiration Timestamp ColumnFamily stored in Accumulo=
=A0 private String expTsDateFormat =3D "yyyyMMddHHmmssS";
= =A0 SimpleDateFormat df =3D new SimpleDateFormat(expTsDateFormat);

= =A0 // TODO: make expTs settable via Iterator Options
=A0 // ColumnFamily containing Expiration Timestamp value (note ingest app<= br>=A0 // did NOT assign a ColumnQualifier, only a ColumnFamily)
=A0 pri= vate String expTsColFam =3D "expTs";

=A0 @Override
=A0 = public boolean acceptRow(SortedKeyValueIterator<Key, Value> rowIterat= or)
=A0=A0=A0 throws IOException {

=A0=A0=A0 if (rowIterator.getTopKey()= .getColumnFamily().toString().equals(expTsColFam)) {
=A0=A0 =A0=A0=A0 Da= te expTsDate =3D null;
=A0=A0 =A0=A0=A0 try {
=A0=A0=A0=A0 =A0=A0=A0 = expTsDate =3D df.parse(rowIterator.getTopValue().toString());
=A0=A0 =A0=A0=A0 =A0=A0=A0 if (expTsDate.getTime() < currentTime)
=A0= =A0=A0=A0 =A0=A0=A0 =A0=A0=A0 return false;
=A0=A0 =A0=A0=A0 } catch (Pa= rseException e) {
=A0=A0=A0=A0 =A0=A0=A0 // TODO Auto-generated catch bl= ock
=A0=A0=A0=A0 =A0=A0=A0 e.printStackTrace();
=A0=A0 =A0=A0=A0 }=A0=A0=A0 }
=A0=A0=A0 return true;
=A0 }

=A0 @Override
=A0 public void ini= t(SortedKeyValueIterator<Key, Value> source,
=A0=A0=A0=A0=A0 Map&l= t;String, String> options, IteratorEnvironment env) throws IOException {=
=A0=A0=A0 super.init(source, options, env);
=A0=A0=A0 currentTime =3D System.currentTimeMillis();
=A0 }

}



On Tue, Nov 5, 2013 at 8:48 PM, William Slacum <wilhelm.von.cloud@accumulo.net> wrote:
If an it= erator is only set at scan time, then its logic will only be applied when a= client scans the table. The data will persist through major and minor comp= action and be visible if you scanned the RFile(s) backing the table. "= Suppress" is the better word in this case. Would you please open a tic= ket pointing us where to update the documentation?

It looks like you'd want to implement a RowFilter for yo= ur use case. It has the necessary hooks to avoid reading a whole row into m= emory and handling the logic of determining whether or not to write keys th= at occur before the column you're filtering on (at the cost of reading = those keys twice).


On Tue, Nov 5, 2013 at 6:20 PM, Terry P. <= texpilot@gmail.com> wrote:
Greetings everyone,
I'm looking at the AgeOffFilter as = a base from which to write a server-side filter / iterator to purge rows wh= en they have aged off based on the value of a specific column in the row (e= xpiry datetime <=3D now). So this differs from the AgeOffFilter in that = the criterion for removal is from the same column in every row (not the Acc= umulo timestamp for an individual entry), and we need to remove the entire = row not just individual entries. For example:

Format: Key:CF:CQ:Value
abc:data:title:"M= y fantastic data"
abc:data:content:<bytedata>
abc:data:cre= Ts:2013-08-04T17:14:12Z
abc:data:expTs:2013-11-04T17:14:12Z
... 6-8 more columns of data per row ...

where= expTs is the column to determine if the entire row should be remove= d based on whether its value is <=3D NOW.

This task se= emed easy enough as a client program (and it is really), but a server-side = iterator would be far more efficient than sending millions of rowkeys acros= s the network just to delete them (we'll be deleting more than a millio= n every hour).=A0 But I'm struggling to get there.

In looking at AgeOffFilter.java, is the "magic" in the = AgeOffFilter class that removes (deletes) an entry from a table the fact th= at the accept method returns false, combined with the fact that the iterato= r would be set to run at -majc or -minc time and it is the compaction code = that actually deletes the entry?=A0 If set to run only at scan time, would = AgeOffFilter simply not return the rows during the scan, but not delete the= m?=A0 The wording in the iterator classes varies, some saying "remove&= quot; others say "suppress" so it's not clear to me

If that's the case, then I think I know where to implement the logi= c. The question is, how can I remove all the entries for the row once the a= ccept method has determined it meets the criteria?

Or as = Mike Drob mentioned in a prior post, will basing my class on the RowFilter = class instead of just Filter make things easier?=A0 Or the WholeRowIterator= ?=A0 Just trying to find the simplest solution.

Sorry for what may be obvious questions but I'= ;m more of a DB Architect that does some coding, and not a Java programmer = by trade. With all of the amazing things Accumulo does, honestly I was surp= rised when I couldn't find a way to delete rows in the shell by criteri= a other than the rowkey!=A0 I'm more used to having a shell to 'del= ete from table where column <=3D value'.=A0
But looking at it now, everyone's criteria for deletion will likely= be different given the flexibility of a key=3D>value store.=A0 If our r= owkey had the date/timestamp as a prefix, I know an easy deletemany command= in the shell would do the trick -- but the nature of the data is such that= initially no expiration timestamp is set, and there is no means to update = the key from the client app when expiration timestamp finally gets set (too= much rework on that common tool I'm afraid).

Thanks in advance.



--089e0160c35e2abfdc04ea8835c5--