Mailing-List: contact user-help@accumulo.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@accumulo.apache.org
Received-SPF: pass (nike.apache.org: domain of billie.rinaldi@gmail.com
 designates 209.85.219.54 as permitted sender)
MIME-Version: 1.0
In-Reply-To: 
 <CAPnhrdssnRPB15dxqb17KOuBY++oP-rD2Sw7MqKk4onBdrhRNg@mail.gmail.com>
References: 
 <CAPnhrdszdM44OLKvN79toeXtjVf01JHOwnhOdnPYgLU10C9r+Q@mail.gmail.com>
	<CAMz+Dut_gntdOXzinBEwQphqc7OTNoutSgTVYKfwQ0-p8UGizw@mail.gmail.com>
	<CAPnhrdssnRPB15dxqb17KOuBY++oP-rD2Sw7MqKk4onBdrhRNg@mail.gmail.com>
Date: Wed, 6 Nov 2013 12:43:55 -0800
Message-ID: 
 <CAF1jEfAiYN0USAvQgCbnh4J=3FP5LqEL5WwG7hSwaPuLTF=oOA@mail.gmail.com>
Subject: Re: How to remove entire row at the server side?
From: Billie Rinaldi <billie.rinaldi@gmail.com>
To: "user@accumulo.apache.org" <user@accumulo.apache.org>
Content-Type: multipart/alternative; boundary=089e0160c35e2abfdc04ea8835c5

--089e0160c35e2abfdc04ea8835c5
Content-Type: text/plain; charset=ISO-8859-1

Is there a typo in the package name?  One place says "com" and the other
"org".


On Wed, Nov 6, 2013 at 12:37 PM, Terry P. <texpilot@gmail.com> wrote:

> Hi William, many thanks for the explanation of scan time versus compaction
> time. I'll look through the classes again and note where the remove versus
> suppress wordings are used and open a ticket.
>
> As mentioned, I only dabble in java, but regardless of that fact at this
> point I'm the one that has to get this done. I've hobbled together my first
> attempt, but I get the following error where I try to add it as a scan
> iterator for testing:
>
> root@meta> setiter -class
> org.esa.accumulo.iterators.ExpirationTimestampPurgeFilter -n expTsFilter -p
> 20 -scan -t itertest
> 2013-11-06 14:06:34,914 [shell.Shell] ERROR:
> org.apache.accumulo.core.util.shell.ShellCommandException: Command could
> not be initialized (Servers are unable to load
> org.esa.accumulo.iterators.ExpirationTimestampPurgeFilter as type
> org.apache.accumulo.core.iterators.SortedKeyValueIterator)
>
> Here's my source.  Note that the value stored in the expTs ColFam is in
> the format "yyyyMMddHHmmssS", which I convert to a long for a direct
> comparison to System.currentTimeMillis(). I only overrode the init and
> acceptRow methods, hoping the others would work as-is from the base class.
>
> One clarification: turns out expTs is the ColumnFamily, and the ingest app
> does not assign a ColumnQualifier for expTs. So to amend my prior table
> layout (including the datetime format):
>
>
> Format: Key:CF:CQ:Value
> abc:data:title:"My fantastic data"
> abc:data:content:<bytedata>
> abc:creTs::20130804171412445
> abc:*expTs*::20131104171412445
> ... 6-8 more columns of data per row ...
>
> where *expTs* is the ColumnFamily to determine if the entire row should
> be removed based on whether its value is <= NOW.  If a row has not yet been
> assigned an expiration date, expTs will not be set and the ColumnFamily
> will not yet be present.  Seems like an odd choice to use distinct Column
> Families, without Column Qualifiers, but that's how the ingest app was done.
>
> I greatly appreciate any advice you can provide.
>
> package com.esa.accumulo.iterators;
>
> import java.io.IOException;
> import java.text.ParseException;
> import java.text.SimpleDateFormat;
> import java.util.Date;
> import java.util.Map;
>
> import org.apache.accumulo.core.data.Key;
> import org.apache.accumulo.core.data.Value;
> import org.apache.accumulo.core.iterators.IteratorEnvironment;
> import org.apache.accumulo.core.iterators.SortedKeyValueIterator;
> import org.apache.accumulo.core.iterators.user.RowFilter;
>
> /**
>  * A filter that removes rows based on the column designated as the
> "expiration timestamp" column family.
>  *
>  * It removes the row if the value in the expirationTimestamp column is
> less than currentTime.
>  *
>  * TODO: The designation of the expirationTimestamp ColumnFamily and its
> DateFormat is
>  * set in the iterator options when the iterator is applied to the table.
> (For
>  * now it is hardcoded to match the format used in the Solr-Accumulo
> plugin)
>  */
> public class ExpirationTimestampPurgeFilter extends RowFilter {
>   private long currentTime;
>   // TODO: make accumuloDateFormat settable via Iterator Options
>   // Date Format for Expiration Timestamp ColumnFamily stored in Accumulo
>   private String expTsDateFormat = "yyyyMMddHHmmssS";
>   SimpleDateFormat df = new SimpleDateFormat(expTsDateFormat);
>
>   // TODO: make expTs settable via Iterator Options
>   // ColumnFamily containing Expiration Timestamp value (note ingest app
>   // did NOT assign a ColumnQualifier, only a ColumnFamily)
>   private String expTsColFam = "expTs";
>
>   @Override
>   public boolean acceptRow(SortedKeyValueIterator<Key, Value> rowIterator)
>     throws IOException {
>
>     if
> (rowIterator.getTopKey().getColumnFamily().toString().equals(expTsColFam)) {
>        Date expTsDate = null;
>        try {
>          expTsDate = df.parse(rowIterator.getTopValue().toString());
>            if (expTsDate.getTime() < currentTime)
>              return false;
>        } catch (ParseException e) {
>          // TODO Auto-generated catch block
>          e.printStackTrace();
>        }
>     }
>     return true;
>   }
>
>   @Override
>   public void init(SortedKeyValueIterator<Key, Value> source,
>       Map<String, String> options, IteratorEnvironment env) throws
> IOException {
>     super.init(source, options, env);
>     currentTime = System.currentTimeMillis();
>   }
>
> }
>
>
>
> On Tue, Nov 5, 2013 at 8:48 PM, William Slacum <
> wilhelm.von.cloud@accumulo.net> wrote:
>
>> If an iterator is only set at scan time, then its logic will only be
>> applied when a client scans the table. The data will persist through major
>> and minor compaction and be visible if you scanned the RFile(s) backing the
>> table. "Suppress" is the better word in this case. Would you please open a
>> ticket pointing us where to update the documentation?
>>
>> It looks like you'd want to implement a RowFilter for your use case. It
>> has the necessary hooks to avoid reading a whole row into memory and
>> handling the logic of determining whether or not to write keys that occur
>> before the column you're filtering on (at the cost of reading those keys
>> twice).
>>
>>
>> On Tue, Nov 5, 2013 at 6:20 PM, Terry P. <texpilot@gmail.com> wrote:
>>
>>> Greetings everyone,
>>> I'm looking at the AgeOffFilter as a base from which to write a
>>> server-side filter / iterator to purge rows when they have aged off based
>>> on the value of a specific column in the row (expiry datetime <= now). So
>>> this differs from the AgeOffFilter in that the criterion for removal is
>>> from the same column in every row (not the Accumulo timestamp for an
>>> individual entry), and we need to remove the entire row not just individual
>>> entries. For example:
>>>
>>> Format: Key:CF:CQ:Value
>>> abc:data:title:"My fantastic data"
>>> abc:data:content:<bytedata>
>>> abc:data:creTs:2013-08-04T17:14:12Z
>>> abc:data:*expTs*:2013-11-04T17:14:12Z
>>> ... 6-8 more columns of data per row ...
>>>
>>> where *expTs* is the column to determine if the entire row should be
>>> removed based on whether its value is <= NOW.
>>>
>>> This task seemed easy enough as a client program (and it is really), but
>>> a server-side iterator would be far more efficient than sending millions of
>>> rowkeys across the network just to delete them (we'll be deleting more than
>>> a million every hour).  But I'm struggling to get there.
>>>
>>> In looking at AgeOffFilter.java, is the "magic" in the AgeOffFilter
>>> class that removes (deletes) an entry from a table the fact that the accept
>>> method returns false, combined with the fact that the iterator would be set
>>> to run at -majc or -minc time and it is the compaction code that actually
>>> deletes the entry?  If set to run only at scan time, would AgeOffFilter
>>> simply not return the rows during the scan, but not delete them?  The
>>> wording in the iterator classes varies, some saying "remove" others say
>>> "suppress" so it's not clear to me
>>>
>>> If that's the case, then I think I know where to implement the logic.
>>> The question is, how can I remove all the entries for the row once the
>>> accept method has determined it meets the criteria?
>>>
>>> Or as Mike Drob mentioned in a prior post, will basing my class on the
>>> RowFilter class instead of just Filter make things easier?  Or the
>>> WholeRowIterator?  Just trying to find the simplest solution.
>>>
>>> Sorry for what may be obvious questions but I'm more of a DB Architect
>>> that does some coding, and not a Java programmer by trade. With all of the
>>> amazing things Accumulo does, honestly I was surprised when I couldn't find
>>> a way to delete rows in the shell by criteria other than the rowkey!  I'm
>>> more used to having a shell to 'delete from *table *where *column *<=
>>> *value*'.
>>>
>>> But looking at it now, everyone's criteria for deletion will likely be
>>> different given the flexibility of a key=>value store.  If our rowkey had
>>> the date/timestamp as a prefix, I know an easy deletemany command in the
>>> shell would do the trick -- but the nature of the data is such that
>>> initially no expiration timestamp is set, and there is no means to update
>>> the key from the client app when expiration timestamp finally gets set (too
>>> much rework on that common tool I'm afraid).
>>>
>>> Thanks in advance.
>>>
>>
>>
>

--089e0160c35e2abfdc04ea8835c5
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr">Is there a typo in the package name?=A0 One place says &qu=
ot;com&quot; and the other &quot;org&quot;.<br></div><div class=3D"gmail_ex=
tra"><br><br><div class=3D"gmail_quote">On Wed, Nov 6, 2013 at 12:37 PM, Te=
rry P. <span dir=3D"ltr">&lt;<a href=3D"mailto:texpilot@gmail.com" target=
=3D"_blank">texpilot@gmail.com</a>&gt;</span> wrote:<br>
<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex"><div dir=3D"ltr"><div><div><div>Hi William, =
many thanks for the explanation of scan time versus compaction time. I&#39;=
ll look through the classes again and note where the remove versus suppress=
 wordings are used and open a ticket.<br>


<br></div><div>As mentioned, I only dabble in java, but regardless of that =
fact at this point I&#39;m the one that has to get this done. I&#39;ve hobb=
led together my first attempt, but I get the following error where I try to=
 add it as a scan iterator for testing:<br>

</div>
<span style=3D"font-family:courier new,monospace"><br>root@meta&gt; setiter=
 -class org.esa.accumulo.iterators.ExpirationTimestampPurgeFilter -n expTsF=
ilter -p 20 -scan -t itertest<br>2013-11-06 14:06:34,914 [shell.Shell] ERRO=
R: org.apache.accumulo.core.util.shell.ShellCommandException: Command could=
 not be initialized (Servers are unable to load org.esa.accumulo.iterators.=
ExpirationTimestampPurgeFilter as type org.apache.accumulo.core.iterators.S=
ortedKeyValueIterator)</span><br>


<br></div>Here&#39;s my source.=A0 Note that the value stored in the expTs =
ColFam is in the format &quot;yyyyMMddHHmmssS&quot;, which I convert to a l=
ong for a direct comparison to System.currentTimeMillis(). I only overrode =
the init and acceptRow methods, hoping the others would work as-is from the=
 base class.<br>

<br></div>One clarification: turns out expTs is the ColumnFamily, and the i=
ngest app does not assign a ColumnQualifier for expTs. So to amend my prior=
 table layout (including the datetime format):<div class=3D"im"><br><br>
<div>Format: Key:CF:CQ:Value<br>
</div></div><div><div class=3D"im">abc:data:title:&quot;My fantastic data&q=
uot;<br>abc:data:content:&lt;bytedata&gt;<br></div>abc:creTs::2013080417141=
2445<br>abc:<b>expTs</b>::20131104171412445<br>


</div><div class=3D"im"><div>... 6-8 more columns of data per row ...<br></=
div></div><div><br>where <b>expTs</b> is the ColumnFamily to determine if t=
he entire row should be removed based on whether its value is &lt;=3D NOW.=
=A0 If a row has not yet been assigned an expiration date, expTs will not b=
e set and the ColumnFamily will not yet be present.=A0 Seems like an odd ch=
oice to use distinct Column Families, without Column Qualifiers, but that&#=
39;s how the ingest app was done.<br>

</div><div><br>I greatly appreciate any advice you can provide.<br><br><spa=
n style=3D"font-family:courier new,monospace">package com.esa.accumulo.iter=
ators;<br><br>import java.io.IOException;<br>import java.text.ParseExceptio=
n;<br>

import java.text.SimpleDateFormat;<br>import java.util.Date;<br>import java=
.util.Map;<br><br>import org.apache.accumulo.core.data.Key;<br>import org.a=
pache.accumulo.core.data.Value;<br>import org.apache.accumulo.core.iterator=
s.IteratorEnvironment;<br>

import org.apache.accumulo.core.iterators.SortedKeyValueIterator;<br>import=
 org.apache.accumulo.core.iterators.user.RowFilter;<br><br>/**<br>=A0* A fi=
lter that removes rows based on the column designated as the &quot;expirati=
on timestamp&quot; column family.<br>

=A0* <br>=A0* It removes the row if the value in the expirationTimestamp co=
lumn is less than currentTime.<br>=A0* <br>=A0* TODO: The designation of th=
e expirationTimestamp ColumnFamily and its DateFormat is<br>=A0* set in the=
 iterator options when the iterator is applied to the table. (For<br>

=A0* now it is hardcoded to match the format used in the Solr-Accumulo plug=
in)<br>=A0*/<br>public class ExpirationTimestampPurgeFilter extends RowFilt=
er {<br>=A0 private long currentTime;<br>=A0 // TODO: make accumuloDateForm=
at settable via Iterator Options<br>

=A0 // Date Format for Expiration Timestamp ColumnFamily stored in Accumulo=
<br>=A0 private String expTsDateFormat =3D &quot;yyyyMMddHHmmssS&quot;;<br>=
=A0 SimpleDateFormat df =3D new SimpleDateFormat(expTsDateFormat);<br><br>=
=A0 // TODO: make expTs settable via Iterator Options<br>

=A0 // ColumnFamily containing Expiration Timestamp value (note ingest app<=
br>=A0 // did NOT assign a ColumnQualifier, only a ColumnFamily)<br>=A0 pri=
vate String expTsColFam =3D &quot;expTs&quot;;<br><br>=A0 @Override<br>=A0 =
public boolean acceptRow(SortedKeyValueIterator&lt;Key, Value&gt; rowIterat=
or)<br>

=A0=A0=A0 throws IOException {<br><br>=A0=A0=A0 if (rowIterator.getTopKey()=
.getColumnFamily().toString().equals(expTsColFam)) {<br>=A0=A0 =A0=A0=A0 Da=
te expTsDate =3D null;<br>=A0=A0 =A0=A0=A0 try {<br>=A0=A0=A0=A0 =A0=A0=A0 =
expTsDate =3D df.parse(rowIterator.getTopValue().toString());<br>

=A0=A0 =A0=A0=A0 =A0=A0=A0 if (expTsDate.getTime() &lt; currentTime)<br>=A0=
=A0=A0=A0 =A0=A0=A0 =A0=A0=A0 return false;<br>=A0=A0 =A0=A0=A0 } catch (Pa=
rseException e) {<br>=A0=A0=A0=A0 =A0=A0=A0 // TODO Auto-generated catch bl=
ock<br>=A0=A0=A0=A0 =A0=A0=A0 e.printStackTrace();<br>=A0=A0 =A0=A0=A0 }<br=
>=A0=A0=A0 }<br>

=A0=A0=A0 return true;<br>=A0 }<br><br>=A0 @Override<br>=A0 public void ini=
t(SortedKeyValueIterator&lt;Key, Value&gt; source,<br>=A0=A0=A0=A0=A0 Map&l=
t;String, String&gt; options, IteratorEnvironment env) throws IOException {=
<br>=A0=A0=A0 super.init(source, options, env);<br>

=A0=A0=A0 currentTime =3D System.currentTimeMillis();<br>=A0 }<br><br>}</sp=
an><div><div class=3D"h5"><br><br><div><div class=3D"gmail_extra"><br><div =
class=3D"gmail_quote">On Tue, Nov 5, 2013 at 8:48 PM, William Slacum <span =
dir=3D"ltr">&lt;<a href=3D"mailto:wilhelm.von.cloud@accumulo.net" target=3D=
"_blank">wilhelm.von.cloud@accumulo.net</a>&gt;</span> wrote:<br>


<blockquote class=3D"gmail_quote" style=3D"margin:0px 0px 0px 0.8ex;border-=
left:1px solid rgb(204,204,204);padding-left:1ex"><div dir=3D"ltr">If an it=
erator is only set at scan time, then its logic will only be applied when a=
 client scans the table. The data will persist through major and minor comp=
action and be visible if you scanned the RFile(s) backing the table. &quot;=
Suppress&quot; is the better word in this case. Would you please open a tic=
ket pointing us where to update the documentation?<div>


<br></div><div>It looks like you&#39;d want to implement a RowFilter for yo=
ur use case. It has the necessary hooks to avoid reading a whole row into m=
emory and handling the logic of determining whether or not to write keys th=
at occur before the column you&#39;re filtering on (at the cost of reading =
those keys twice).<br>


<div><br></div><div><br></div></div></div><div><div><div class=3D"gmail_ext=
ra"><div class=3D"gmail_quote">On Tue, Nov 5, 2013 at 6:20 PM, Terry P. <sp=
an dir=3D"ltr">&lt;<a href=3D"mailto:texpilot@gmail.com" target=3D"_blank">=
texpilot@gmail.com</a>&gt;</span> wrote:<br>


<blockquote class=3D"gmail_quote" style=3D"margin:0px 0px 0px 0.8ex;border-=
left:1px solid rgb(204,204,204);padding-left:1ex"><div dir=3D"ltr"><div><di=
v><div>Greetings everyone,<br></div>I&#39;m looking at the AgeOffFilter as =
a base from which to write a server-side filter / iterator to purge rows wh=
en they have aged off based on the value of a specific column in the row (e=
xpiry datetime &lt;=3D now). So this differs from the AgeOffFilter in that =
the criterion for removal is from the same column in every row (not the Acc=
umulo timestamp for an individual entry), and we need to remove the entire =
row not just individual entries. For example:<br>


<br></div><div>Format: Key:CF:CQ:Value<br></div><div>abc:data:title:&quot;M=
y fantastic data&quot;<br>abc:data:content:&lt;bytedata&gt;<br>abc:data:cre=
Ts:2013-08-04T17:14:12Z<br>abc:data:<b>expTs</b>:2013-11-04T17:14:12Z<br>


</div><div>... 6-8 more columns of data per row ...<br></div><div><br>where=
 <b>expTs</b> is the column to determine if the entire row should be remove=
d based on whether its value is &lt;=3D NOW.<br></div><div><br>This task se=
emed easy enough as a client program (and it is really), but a server-side =
iterator would be far more efficient than sending millions of rowkeys acros=
s the network just to delete them (we&#39;ll be deleting more than a millio=
n every hour).=A0 But I&#39;m struggling to get there.<br>


</div><div>


<br></div>In looking at AgeOffFilter.java, is the &quot;magic&quot; in the =
AgeOffFilter class that removes (deletes) an entry from a table the fact th=
at the accept method returns false, combined with the fact that the iterato=
r would be set to run at -majc or -minc time and it is the compaction code =
that actually deletes the entry?=A0 If set to run only at scan time, would =
AgeOffFilter simply not return the rows during the scan, but not delete the=
m?=A0 The wording in the iterator classes varies, some saying &quot;remove&=
quot; others say &quot;suppress&quot; so it&#39;s not clear to me<br>


<br>If that&#39;s the case, then I think I know where to implement the logi=
c. The question is, how can I remove all the entries for the row once the a=
ccept method has determined it meets the criteria?<br><br></div><div>Or as =
Mike Drob mentioned in a prior post, will basing my class on the RowFilter =
class instead of just Filter make things easier?=A0 Or the WholeRowIterator=
?=A0 Just trying to find the simplest solution.<br>


</div><div><br></div><div>Sorry for what may be obvious questions but I&#39=
;m more of a DB Architect that does some coding, and not a Java programmer =
by trade. With all of the amazing things Accumulo does, honestly I was surp=
rised when I couldn&#39;t find a way to delete rows in the shell by criteri=
a other than the rowkey!=A0 I&#39;m more used to having a shell to &#39;del=
ete from <i>table </i>where <i>column </i>&lt;=3D <i>value</i>&#39;.=A0 <br=
>


<br>But looking at it now, everyone&#39;s criteria for deletion will likely=
 be different given the flexibility of a key=3D&gt;value store.=A0 If our r=
owkey had the date/timestamp as a prefix, I know an easy deletemany command=
 in the shell would do the trick -- but the nature of the data is such that=
 initially no expiration timestamp is set, and there is no means to update =
the key from the client app when expiration timestamp finally gets set (too=
 much rework on that common tool I&#39;m afraid). <br>


<br></div><div>Thanks in advance.<br></div></div>
</blockquote></div><br></div>
</div></div></blockquote></div><br></div></div></div></div></div></div>
</blockquote></div><br></div>

--089e0160c35e2abfdc04ea8835c5--