accumulo-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Keith Turner <>
Subject Re: Knowing when an iterator is at the "last row/entry"
Date Wed, 08 Jan 2014 17:00:48 GMT
On Wed, Jan 8, 2014 at 10:41 AM, Terry P. <> wrote:

> Hi Keith,
> The goal of the iterator is to purge data that has expired (or suppress it
> for scans). The goal of the log message is to bring to light any data
> format issues, as otherwise the "bad data" would NOT be purged by the
> iterator and hang around forever, which would be bad, so yes we would purge
> it with a special job. The iterator fires at both Full Major Compaction and
> at Scan time.

So you want to use the summary data from scans to know if you should kick
off a full major compaction?  In 1.6.0 compaction strategies were added
(ACCUMULO-1451).  If scans could provide information to these compaction
strategies, then that would lay the ground work for ACCUMULO-1266 and what
you are trying to achieve.  I am not sure of the best way to do this.
 Maybe when a scan iterator is closed it could update counters (maybe
counters encourage small memory usage).  The compaction strategy could
access the counters and use them to make a decision about doing a full
major compaction.

> Good point on "How did the bad data get there?" -- it shouldn't based on
> how items are indexed and then inserted into Accumulo, but I wanted to
> check for it in case the individual that installs the iterator in Accumulo
> fat-fingers the date format, OR if someone changes it on the other side
> (the app that sends the data to Accumulo). The first one could happen
> easily, but the latter shouldn't happen. But as folks roll off programs and
> others maintain the code, anything can happen.

> Looks like ACCUMULO-1280 is exactly what I need! Maybe someday, but until
> then what I have for the iterator will do the job (and thanks again for
> your help on it!).
> Best regards,
> Terry
> On Wed, Jan 8, 2014 at 9:30 AM, Keith Turner <> wrote:
>> whats is your goal?  It seems like you want to produce counts about bad
>> data suppressed at scan time.  What will you do with these counts?  Will
>> you ever purge the bad data?  How did the bad data get there?  If you are
>> not bulk importing the data, then maybe you could add constraints to the
>> table?
>>  On Mon, Jan 6, 2014 at 7:30 PM, Terry P. <> wrote:
>>> Greetings folks,
>>> I have an iterator that extends RowFilter and I have a case where I need
>>> to know when its defined date format doesn't match the format of the data
>>> being scanned by the iterator.  I don't want to flood the tserver log with
>>> an error per row (how horrid that would be), but instead keep a counter of
>>> the number of times that error occurs during a scan or major compaction.
>>> Trouble is, I don't see any way to know when an iterator is on the "last
>>> row" or "last entry" in its scan on a tabletserver, as if I could test for
>>> that, I could then dump my single log message with the count of date format
>>> parse errors for that scan/compaction.
>>> Anyone know a way to determine if an iterator is at the "last entry" or
>>> "last row" of its execution?
>> I do not think there is a good way to do this.  ACCUMULO-1280
>>> Many thanks in advance.

View raw message