accumulo-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Keith Turner <>
Subject Re: Knowing when an iterator is at the "last row/entry"
Date Wed, 08 Jan 2014 21:59:01 GMT
On Wed, Jan 8, 2014 at 3:53 PM, Terry P. <> wrote:

> Hi Keith,
> Well, not exactly (you're leaps ahead of me), but that is a great idea!
> Meaning, if we could do what you're suggesting, we wouldn't need to run a
> weekly maintenance job to actually perform a full major compaction in order
> to purge out expired data.  But for our iterator, we simply wanted to keep
> tabs on possible bad data (incorrect date formats to be exact) which would
> prevent the data from being purged properly by the iterator when a full
> major compaction was performed (which we intend to schedule weekly or as
> required).

Ok, now  I understand what your problem is.  For the rows that you are
unable to make a purge decision about, you could emit a special column.
This would require a follow on iterator that looks for bad rows and adds
this special column family.   Then you could scan the table for this column
family in order to find the bad rows.   This may or may not be quick.   If
you have less than 1000 unique column families in a RFile then Accumulo
will keep track of them.  For a scan that calls fetchColumn(), this will
allow it to quickly skip entire files that do not contain the column
family.  You would also have to make your query code ignore this special
column family in some way.

If you have more than 1000 unique column fams and you have large rows (lots
of columns) this approach may still be fast.  When fetching column
families, Accumulo will seek to the column family in the row.  So if you
have 1000 rows each having 1000000 columns and you fetch col fam x, then it
will seek to col fam x 1000 times.   If col fam x only exist in one row,
then it would find it quickly in this case.

For small rows and lots of unique column families you would end up scanning
most of the table in order to find your few bad rows.

In any case you could use a batch scanner to parallelize the scan.

One note about this approach iterators can introduce data, but must
generate sorted data.  Introducing data could be tricky for scans, would
need to handle seeks properly.  Howerver for compactions (w/o locality
groups), it should be fairly straight forward.

> I took a look at ACCUMULO-1266, and that sounds really useful -- but
> wouldn't that still rely on having a close() method (or something similar)
> in iterators, which is exactly what I have run into as lacking (and for
> which you opened ACCUMULO-1280)?

yeah it seems like it may rely on close()

> On Wed, Jan 8, 2014 at 11:00 AM, Keith Turner <> wrote:
>> On Wed, Jan 8, 2014 at 10:41 AM, Terry P. <> wrote:
>>> Hi Keith,
>>> The goal of the iterator is to purge data that has expired (or suppress
>>> it for scans). The goal of the log message is to bring to light any data
>>> format issues, as otherwise the "bad data" would NOT be purged by the
>>> iterator and hang around forever, which would be bad, so yes we would purge
>>> it with a special job. The iterator fires at both Full Major Compaction and
>>> at Scan time.
>> So you want to use the summary data from scans to know if you should kick
>> off a full major compaction?  In 1.6.0 compaction strategies were added
>> (ACCUMULO-1451).  If scans could provide information to these compaction
>> strategies, then that would lay the ground work for ACCUMULO-1266 and what
>> you are trying to achieve.  I am not sure of the best way to do this.
>>  Maybe when a scan iterator is closed it could update counters (maybe
>> counters encourage small memory usage).  The compaction strategy could
>> access the counters and use them to make a decision about doing a full
>> major compaction.
>>> Good point on "How did the bad data get there?" -- it shouldn't based on
>>> how items are indexed and then inserted into Accumulo, but I wanted to
>>> check for it in case the individual that installs the iterator in Accumulo
>>> fat-fingers the date format, OR if someone changes it on the other side
>>> (the app that sends the data to Accumulo). The first one could happen
>>> easily, but the latter shouldn't happen. But as folks roll off programs and
>>> others maintain the code, anything can happen.
>>> Looks like ACCUMULO-1280 is exactly what I need! Maybe someday, but
>>> until then what I have for the iterator will do the job (and thanks again
>>> for your help on it!).
>>> Best regards,
>>> Terry
>>> On Wed, Jan 8, 2014 at 9:30 AM, Keith Turner <> wrote:
>>>> whats is your goal?  It seems like you want to produce counts about bad
>>>> data suppressed at scan time.  What will you do with these counts?  Will
>>>> you ever purge the bad data?  How did the bad data get there?  If you are
>>>> not bulk importing the data, then maybe you could add constraints to the
>>>> table?
>>>>  On Mon, Jan 6, 2014 at 7:30 PM, Terry P. <> wrote:
>>>>> Greetings folks,
>>>>> I have an iterator that extends RowFilter and I have a case where I
>>>>> need to know when its defined date format doesn't match the format of
>>>>> data being scanned by the iterator.  I don't want to flood the tserver
>>>>> with an error per row (how horrid that would be), but instead keep a
>>>>> counter of the number of times that error occurs during a scan or major
>>>>> compaction.
>>>>> Trouble is, I don't see any way to know when an iterator is on the
>>>>> "last row" or "last entry" in its scan on a tabletserver, as if I could
>>>>> test for that, I could then dump my single log message with the count
>>>>> date format parse errors for that scan/compaction.
>>>>> Anyone know a way to determine if an iterator is at the "last entry"
>>>>> or "last row" of its execution?
>>>> I do not think there is a good way to do this.  ACCUMULO-1280
>>>>> Many thanks in advance.

View raw message