accumulo-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Billie J Rinaldi <>
Subject Re: Time based locality groups
Date Thu, 08 Mar 2012 16:42:09 GMT
On Thursday, March 8, 2012 7:01:49 AM, "Adam Fuchs" <> wrote:
> Yes, yes, yes, this is going to be a very useful feature set! (I told
> Andie
> all about it and she agreed whole-heartedly)
> I think that step one needs to be figuring out how to expose this in
> the
> API, and the iterator interface is the place to start. Once we have
> defined
> an abstraction layer, we can experiment with lots of different
> implementations at the RFile layer. If we are going to broadly extend
> these
> locality group-type filtering optimizations, it might make sense to
> drop
> the specialization for column family filtering that is part of the
> SortedKeyValueIterator seek method. Then we could support column
> family
> filtering, timestamp filtering, cell-level security filtering, etc. as
> separate iterators. The specialization for column family filtering is
> our
> current mechanism for optimizing that operation in the RFile, but we
> could
> be a little smarter about how we do this.
> What I'm suggesting is that when we construct an iterator tree we look
> for
> iterators on top of the RFile reader that we can collapse and
> implement as
> part of the RFile reader. So, if a column family filtering iterator is
> on
> top of the RFile then we can grab its set of column families and
> replace it
> with the filtered RFile reader. If we add a little knowledge about
> commutativity of iterators then we can even collapse filters that are
> not
> directly on top of the RFile reader (like there might be a merging
> iterator
> between the RFile reader and the column family filtering iterator).
> One way
> we could implement this is by changing the factory method that
> generates
> iterators. When this method calls the init method on a newly
> constructed
> iterator it can instead push that iterator down through the tree and
> return
> the source iterator instead. We might be able to specialize the
> iterator
> environment to signal the optimization and avoid any changes to the
> here.
> Once we get to the point of optimizing the RFile, I think what we
> might
> find is that the RFile entries are naturally grouped by time into
> blocks in
> many cases. A simple timestamp-based block filter might be optimal in
> these
> cases. This is what I was talking about with introducing extra
> features
> (timestamp ranges, etc) into the RFile index. I think it also makes
> sense
> to include some aggregate cell-level security markings here.
> One other thing to think about: I like the simpler iterator interface,
> but
> there are some implications to modifying the column family filter set
> during a query that might be tricky. Does anybody change the column
> family
> set mid-query now, anyway? Is that something we would want to support
> for
> timestamps or other filters?

There are iterators that change the column family filter set, so I'm wary of automatically
deciding which iterators can be pulled down into the file.


View raw message