accumulo-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Terry P." <texpi...@gmail.com>
Subject Re: How to remove entire row at the server side?
Date Fri, 08 Nov 2013 05:26:39 GMT
Hi Keith,
Given that what I need to filter on is only the expTs column family, would
it be faster to seek?  I don't know how to seek, but I also can't figure
out how to iterate inside the acceptRow method -- there's no scanner as I
normally use when reading and iterating over key/values.

I read the notes on the seek method and it does seem like using it would be
more efficient since the only criteria for this filter is the expTs column
family and thus only those RFiles would be opened, but I just can figure
out where to start and my Googling hasn't yielded any examples yet.



On Thu, Nov 7, 2013 at 3:16 PM, Keith Turner <keith@deenlo.com> wrote:

>
>
>
> On Thu, Nov 7, 2013 at 3:49 PM, Terry P. <texpilot@gmail.com> wrote:
>
>> Hi Keith,
>> No, expTs won't be the first actually -- that'll teach me to try things
>> with overly simplistic data!
>>
>
>>  There will be 10-12 column families for each row. I take it my simple
>> check for column family name isn't enough?
>>
>
> You can iterate until you see the column or seek to it.   If you expect
> there will always be a small of data before the column occurs, then iterate.
>
>
>>
>>
>> On Thursday, November 7, 2013, Keith Turner wrote:
>>
>>> Your accept row function assumes that expTs will be the first column in
>>> the row, is this always the case?
>>>
>>>
>>> On Wed, Nov 6, 2013 at 3:37 PM, Terry P. <texpilot@gmail.com> wrote:
>>>
>>> Hi William, many thanks for the explanation of scan time versus
>>> compaction time. I'll look through the classes again and note where the
>>> remove versus suppress wordings are used and open a ticket.
>>>
>>> As mentioned, I only dabble in java, but regardless of that fact at this
>>> point I'm the one that has to get this done. I've hobbled together my first
>>> attempt, but I get the following error where I try to add it as a scan
>>> iterator for testing:
>>>
>>> root@meta> setiter -class
>>> org.esa.accumulo.iterators.ExpirationTimestampPurgeFilter -n expTsFilter -p
>>> 20 -scan -t itertest
>>> 2013-11-06 14:06:34,914 [shell.Shell] ERROR:
>>> org.apache.accumulo.core.util.shell.ShellCommandException: Command could
>>> not be initialized (Servers are unable to load
>>> org.esa.accumulo.iterators.ExpirationTimestampPurgeFilter as type
>>> org.apache.accumulo.core.iterators.SortedKeyValueIterator)
>>>
>>> Here's my source.  Note that the value stored in the expTs ColFam is in
>>> the format "yyyyMMddHHmmssS", which I convert to a long for a direct
>>> comparison to System.currentTimeMillis(). I only overrode the init and
>>> acceptRow methods, hoping the others would work as-is from the base class.
>>>
>>> One clarification: turns out expTs is the ColumnFamily, and the ingest
>>> app does not assign a ColumnQualifier for expTs. So to amend my prior table
>>> layout (including the datetime format):
>>>
>>>
>>> Format: Key:CF:CQ:Value
>>> abc:data:title:"My fantastic data"
>>> abc:data:content:<bytedata>
>>> abc:creTs::20130804171412445
>>> abc:*expTs*::20131104171412445
>>> ... 6-8 more columns of data per row ...
>>>
>>> where *expTs* is the ColumnFamily to determine if the entire row should
>>> be removed based on whether its value is <= NOW.  If a row has not yet been
>>> assigned an expiration date, expTs will not be set and the ColumnFamily
>>> will not yet be present.  Seems like an odd choice to use distinct Column
>>> Families, without Column Qualifiers, but that's how the ingest app was done.
>>>
>>> I greatly appreciate any advice you can provide.
>>>
>>> package com.esa.accumulo.iterators;
>>>
>>> import java.io.IOException;
>>> import java.text.ParseException;
>>> import java.text.SimpleDateFormat;
>>> import java.util.Date;
>>> import java.util.Map;
>>>
>>> import org.apache.accumulo.core.data.Key;
>>> import org.apache.accumulo.core.data.Value;
>>> import org.apache.accumulo.core.iterators.IteratorEnvironment;
>>> import org.apache.accumulo.core.iterators.SortedKeyValueIterator;
>>> import org.apache.accumulo.core.iterators.user.RowFilter;
>>>
>>> /**
>>>  * A filter that removes rows based on the column designated as the
>>> "expiration timestamp" column family.
>>>  *
>>>  * It removes the row if the value in the expirationTimestamp column is
>>> less than currentTime.
>>>  *
>>>  * TODO: The designation of the expirationTimestamp ColumnFamily and its
>>> DateFormat is
>>>  * set in the iterator options when the iterator is applied to the
>>> table. (For
>>>  * now it is hardcoded to match the format used in the Solr-Accumulo
>>> plugin)
>>>  */
>>> public class ExpirationTimestampPurgeFilter extends RowFilter {
>>>   private long currentTime;
>>>   // TODO: make accumuloDateFormat settable via Iterator Options
>>>   // Date Format for Expiration Timestamp ColumnFamily stored in Accumulo
>>>   private String expTsDateFormat = "yyyyMMddHHmmssS";
>>>   SimpleDateFormat df = new SimpleDateFormat(expTsDateFormat);
>>>
>>>   // TODO: make expTs settable via Iterator Options
>>>   // ColumnFamily containing Expiration Timestamp value (note ingest app
>>>   // did NOT assign a ColumnQualifier, only a ColumnFamily)
>>>   private String expTsColFam = "expTs";
>>>
>>>   @Override
>>>   public boolean acceptRow(SortedKeyValueIterator<Key, Value>
>>> rowIterator)
>>>     throws IOException {
>>>
>>>     if
>>> (rowIterator.getTopKey().getColumnFamily().toString().equals(expTsColFam)) {
>>>        Date expTsDate = null;
>>>        try {
>>>          expTsDate = df.parse(rowIterator.getTopValue().toString());
>>>            if (expTsDate.getTime() < currentTime)
>>>              return false;
>>>        } catch (ParseException e) {
>>>          // TODO Auto-generated catch block
>>>          e.printStackTrace();
>>>        }
>>>     }
>>>     return true;
>>>   }
>>>
>>>   @Override
>>>   public void init(SortedKeyValueIterator<Key, Value> source,
>>>       Map<String, Str
>>>
>>>
>

Mime
View raw message