accumulo-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Christopher <ctubb...@apache.org>
Subject Re: Scan-time iterators returning out-of-order rows
Date Thu, 02 Apr 2015 07:04:03 GMT
So in your example, you actually did return then in order (lexically, not
numerically), but I grok the idea that they might not be.

The problem is that your transformation promotes a portion of the cq to the
cf. That's fine if what your iterator is returning includes only that from
a single cf (day, the 'data' cf). But otherwise, you could get duplicates
or out of order results, which can mess up the client's expectations when
retrieving batches from the servers. It could work in some limited cases,
but I'd avoid it.

Instead, why not preserve order by preserving the existing schema, and just
ignore the unused cf in the client?

On Thu, Apr 2, 2015, 00:28 Russ Weeks <rweeks@newbrightidea.com> wrote:

> Thanks, Christopher. It's nice to hear an unambiguous point of view :)
>
> Do you see any alternative way of implementing a range scan on a
> partitioned index? The problem does not exist for exact-match scans because
> the row ID in the index entry CQ provides the correct ordering.
>
> Thanks,
> -Russ
>
> On Wed, Apr 1, 2015 at 9:11 PM, Christopher <ctubbsii@apache.org> wrote:
>
> > You should definitely not rely on this behavior. It goes against best
> > practices and is prone to error. It is not recommended.
> >
> > On Wed, Apr 1, 2015, 20:03 Russ Weeks <rweeks@newbrightidea.com> wrote:
> >
> > > A wonderful property of scan-time iterators is that they can emit row
> IDs
> > > in arbitrary order. Before I go off and build an index that relies on
> > this
> > > behaviour, I'd like to get a sense of how likely it is to exist in
> future
> > > versions of Accumulo.
> > >
> > > I'd like to build an index like this (hopefully the ascii comes
> through,
> > if
> > > not check here <https://gist.github.com/anonymous/1a64114da4b68a2ec822
> > >):
> > >
> > >
> > >  row   | cf  | cq                | val
> > > -------------------------------------------------
> > >  p0    | i   | (prop_a, 7, r15)  | 1
> > >  p0    | i   | (prop_a, 8, r8)   | 1
> > >  p0    | i   | (prop_a, 9, r19)  | 1
> > > [...snip...]
> > >  p0    | d   | (r8, prop_a)      | 8
> > >  p0    | d   | (r8, prop_b)      | hello, world
> > >  p0    | d   | (r15, prop_a)     | 7
> > >  p0    | d   | (r15, prop_b)     | just testing
> > >  p0    | d   | (r19, prop_a)     | 9
> > >  p0    | d   | (r19, prop_b)     | something else
> > >
> > > Which is a pretty conventional partitioned index. I'd like to be able
> to
> > > issue a query like, "Tell me about prop_b for all documents where
> prop_a
> > <
> > > 9" but I'm pretty sure that the only way this could work at scale is if
> > > it's OK for the iterator to return (p0, r15, prop_b, "just testing")
> > > followed by (p0, r8, prop_b, "hello, world").
> > >
> > > This works today - if you folks see any flaws in my reasoning please
> let
> > me
> > > know - my question is, do you see this as functionality that should be
> > > preserved in the future?
> > >
> > > Thanks,
> > > -Russ
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message