accumulo-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Keith Turner <>
Subject Re: Filter storing state
Date Thu, 03 Jan 2013 23:10:21 GMT
On Thu, Jan 3, 2013 at 6:08 PM, Corey Nolet <> wrote:
> That's funny you bring that up- because I was JUST discussing this as a possibility with
a coworker. Compaction is really the phase that I'm concerned with- as the API for loading
the data from the TopN currently only allows you to load the last N keys/values for a single
index at a time.
> Can I guarantee that compaction will pass each row through a single filter?

yes and no.   The same iterator instance is used for an entire
compaction and only inited and seeked once.   However sometimes
compactions only process a subset of a tablets files..   Therefore you
can not garuntee you will see all columns in a row, you may only see
subset.  Also if you have locality groups enabled, each localitly
group is compacted separately.

> On Jan 3, 2013, at 5:54 PM, Keith Turner wrote:
>> Data is read from the iterators into a buffer.  When the buffer fills
>> up, the data is sent to the client and the iterators are reinitialized
>> to fill up the next buffer.
>> The default buffer size was changed from 50M to 1M at some point.
>> This is configured via the property table.scan.max.memory
>> The lower buffer size will cause iterator to be reinitialized more
>> frequently.  Maybe you are seeing this.
>> Keith
>> On Thu, Jan 3, 2013 at 5:41 PM, Corey Nolet <> wrote:
>>> Hey Guys,
>>> In "Accumulo 1.3.5", I wrote a "Top N" table structure, services and a
>>> FilteringIterator that would allow us to drop in several keys/values
>>> associated with a UUID (similar to a document id). The UUID was further
>>> associated with an "index" (or type). The purpose of the TopN table was to
>>> keep the keys/values separated so that they could still be queried back with
>>> cell-level tagging, but when I performed a query for an index, I would get
>>> the last N UUIDs and further be able to query the keys/values for each of
>>> those UUIDs.
>>> This problem seemed simple to solve in Accumulo 1.3.5, as I was able to
>>> provide 2 FilteringIterators for compaction time to perform data cleanup of
>>> the table so that any keys/values kept around were guaranteed to be inside
>>> of the range of those keys being managed by the versioning iterator.
>>> Just to recap, I have the following table structure. I also hash the
>>> keys/values and run a filter before the versioning iterator to clean up any
>>> duplicates. There are two types of columns: index & key/value.
>>> Index:
>>> R: index (or "type" of data)
>>> F: '\x00index'
>>> Q: empty
>>> V: uuid\x00hashOfKeys&Values
>>> Key/Value:
>>> R: index (or "type" of data)
>>> F: uuid
>>> Q: key\x00value
>>> V: empty
>>> The filtering iterator that makes sure any key/value rows are in the index
>>> manages a hashset internally. The index rows are purposefully indexed before
>>> the key/value rows so that the filter can build up the hashset containing
>>> those uuids in the index. As the filter iterates into the key/value rows, it
>>> will return true only if the uuid of the key/value exists inside of the
>>> hashset containing the uuids in the index. This worked with older versions
>>> of accumulo but I'm now getting a weird artifact where INIT() is called on
>>> my Filter in the middle of iterating through an index row.
>>> More specifically, the Filter will iterate through the index rows of a
>>> specific "index" and build up a hashset, then init() will be called which
>>> wipes away the hashset of uuids, then the further goes on to iterate through
>>> the key/value rows. Keep in mind, we are talking about maybe 400k entries,
>>> not enough to have more than 1 tablet.
>>> Any idea why this may have worked on 1.3.5 but doesn't work any longer? I
>>> know it has got to be a huge nono to be storing state inside of a filter,
>>> but I haven't had any issues until trying to update my code for the new
>>> version. If I'm doing this completely wrong, any ideas on how to make this
>>> better?
>>> Thanks!
>>> --
>>> Corey Nolet
>>> Senior Software Engineer
>>> TexelTek, inc.
>>> [Office] 301.880.7123
>>> [Cell] 410-903-2110

View raw message