accumulo-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Corey Nolet <cno...@texeltek.com>
Subject Re: Filter storing state
Date Thu, 03 Jan 2013 23:04:58 GMT
John thanks for the quick response!

Crazy enough, I'm not doing much differently than the VersioningIterator as it is storing
the max number of versions that ti should be returning- right? And that's a scan time iterator
(as well as majc/minc).

I am testing it as a scan time iterator (set on the table but using accumulo shell to scan).
Perhaps I should force a couple compactions and see what's left afterwards. 





On Jan 3, 2013, at 5:53 PM, John Vines wrote:

> Are you testing this in scan time or via actual minor/major compactions? I know at scan
time, there is no guarantee that the iterator remains intact through the entire scan, and
it instead may be reconstructed, causing state to be lost. I don't think this is the case
for compaction time iterators, but I'm not positive.
> 
> 
> On Thu, Jan 3, 2013 at 5:41 PM, Corey Nolet <cnolet@texeltek.com> wrote:
> Hey Guys,
> 
> In "Accumulo 1.3.5", I wrote a "Top N" table structure, services and a FilteringIterator
that would allow us to drop in several keys/values associated with a UUID (similar to a document
id). The UUID was further associated with an "index" (or type). The purpose of the TopN table
was to keep the keys/values separated so that they could still be queried back with cell-level
tagging, but when I performed a query for an index, I would get the last N UUIDs and further
be able to query the keys/values for each of those UUIDs.
> 
> This problem seemed simple to solve in Accumulo 1.3.5, as I was able to provide 2 FilteringIterators
for compaction time to perform data cleanup of the table so that any keys/values kept around
were guaranteed to be inside of the range of those keys being managed by the versioning iterator.

> 
> Just to recap, I have the following table structure. I also hash the keys/values and
run a filter before the versioning iterator to clean up any duplicates. There are two types
of columns: index & key/value.
> 
> 
> Index: 
> 
> R: index (or "type" of data)
> F: '\x00index'
> Q: empty
> V: uuid\x00hashOfKeys&Values
> 
> 
> Key/Value:
> 
> R: index (or "type" of data)
> F: uuid
> Q: key\x00value
> V: empty
> 
> 
> The filtering iterator that makes sure any key/value rows are in the index manages a
hashset internally. The index rows are purposefully indexed before the key/value rows so that
the filter can build up the hashset containing those uuids in the index. As the filter iterates
into the key/value rows, it will return true only if the uuid of the key/value exists inside
of the hashset containing the uuids in the index. This worked with older versions of accumulo
but I'm now getting a weird artifact where INIT() is called on my Filter in the middle of
iterating through an index row.
> 
> More specifically, the Filter will iterate through the index rows of a specific "index"
and build up a hashset, then init() will be called which wipes away the hashset of uuids,
then the further goes on to iterate through the key/value rows. Keep in mind, we are talking
about maybe 400k entries, not enough to have more than 1 tablet.
> 
> Any idea why this may have worked on 1.3.5 but doesn't work any longer? I know it has
got to be a huge nono to be storing state inside of a filter, but I haven't had any issues
until trying to update my code for the new version. If I'm doing this completely wrong, any
ideas on how to make this better?
> 
> 
> Thanks!
> 
> 
> -- 
> Corey Nolet
> Senior Software Engineer
> TexelTek, inc.
> [Office] 301.880.7123
> [Cell] 410-903-2110
> 


Mime
View raw message