accumulo-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Russ Weeks <rwe...@newbrightidea.com>
Subject Re: Scan-time iterators returning out-of-order rows
Date Wed, 08 Apr 2015 19:46:13 GMT
Thanks Christopher and David. It sounds like the right way to go is to
encode the data KV pairs in either the CQ or the Value of the index KV
pairs. I can probably make that work.

I really just wanted to get an opinion re. whether out-of-order rows was
behaviour that would probably be preserved across minor releases, and it
sounds like I shouldn't count on that.

Regards,
-Russ

On Tue, Apr 7, 2015 at 6:47 PM, Christopher <ctubbsii@apache.org> wrote:

> Oh, I see. You're seeking around within a tablet. Yeah, that's a tough one.
>
> Have you looked at the IntersectingIterator? I don't know if you could
> transition your schema to something like that, but I mention it
> because it employs a similar concept: term-based searching on a
> sharded, document-partitioned table ("tell me everything about
> document if it contains term X"). The main difference with that,
> though, is that it kind of expects the documents to be co-located with
> the index (terms).
>
> Other people on this list are probably more qualified to speak about
> the IntersectingIterator and similar strategies than I am, though, so
> I'll defer to them, if they chime in.
>
> --
> Christopher L Tubbs II
> http://gravatar.com/ctubbsii
>
>
> On Thu, Apr 2, 2015 at 4:41 PM, Russ Weeks <rweeks@newbrightidea.com>
> wrote:
> > Hi, Christopher,
> >
> > This is how I see it working, let me know if I'm missing something. The
> > query is, "tell me about prop_b for all rows where prop_a < 9"
> >
> > 1. I set up a query from (-inf,+inf) on my partitioned index table with
> my
> > PartitionedRangeSearchIterator in scan scope. I configure the PRSI with
> > some encoded representation of my query.
> > 2. The PRSI seeks to (p0, i, (prop_a))
> > 3. For each key less than (p0, i, (prop_a, 9)),
> >   - Extract the row_id from the last field in the CQ tuple
> >   - Seek to (p0, d, (row_id, prop_b))
> >   - If prop_b exists for row_id, emit (p0, d, (row_id, prop_b)), <value
> of
> > prop_b>
> > [repeat for p1, p2 etc... all partitions available to iterator]
> >
> > It's the seek from the index entries to the data entries that could cause
> > the data entries to be emitted out of order. The index entries are
> ordered
> > by prop_a, the data entries are ordered by row id, the 2 orders will
> hardly
> > ever correspond.
> >
> > -Russ
> >
> > On Thu, Apr 2, 2015 at 1:18 PM, Christopher <ctubbsii@apache.org> wrote:
> >
> >> I'm not sure I understand how they'd be out of order in the iterator
> >> if they aren't out of order in the underlying source.
> >> How would your iterator return:
> >> ((p0, d, (r15, prop_b)), "just testing"),
> >> ((p0, d, (r8, prop_b)), "hello, world")
> >>
> >> when the underlying data is:
> >>  p0    | d   | (r8, prop_b)      | hello, world
> >>  p0    | d   | (r15, prop_b)     | just testing
> >>
> >> ?
> >>
> >> Why would it reorder the existing entries?
> >>
> >> --
> >> Christopher L Tubbs II
> >> http://gravatar.com/ctubbsii
> >>
> >>
> >> On Thu, Apr 2, 2015 at 1:52 PM, Russ Weeks <rweeks@newbrightidea.com>
> >> wrote:
> >> > Thanks for your response, Christopher.
> >> >
> >> > Yes, I see what you mean by promoting the CQ to the CF. I thought that
> >> > would simplify things but if not I could definitely return (k,v) pairs
> >> like,
> >> >
> >> > ((p0, d, (r15, prop_b)), "just testing"),
> >> > ((p0, d, (r8, prop_b)), "hello, world")
> >> >
> >> > (leaving the timestamps and visibilities out for clarity, and assuming
> >> that
> >> > r15 and r8 are encoded such that their lexical order matches their
> >> numeric
> >> > order)
> >> >
> >> > Leaving the existing schema intact is not a problem for me but it
> doesn't
> >> > get around the fact that the (k,v) pairs would be returned
> out-of-order
> >> by
> >> > the iterator. I guess another option would be to embed the "data" kv
> >> pairs
> >> > into the "index" kv pairs somehow, such as:
> >> >
> >> > ((p0, i, ((prop_a, 7, r15), prop_b)), "just testing")
> >> > ((p0, i, ((prop_a, 8, r8), prop_b)), "hello, world")
> >> >
> >> > I'm not too keen on that solution but if you're telling me that I
> >> shouldn't
> >> > rely on scan-time iterators being able to emit data out of order... I
> >> think
> >> > it's the least bad option.
> >> >
> >> > Regards,
> >> > -Russ
> >> >
> >> > On Thu, Apr 2, 2015 at 12:04 AM, Christopher <ctubbsii@apache.org>
> >> wrote:
> >> >
> >> >> So in your example, you actually did return then in order (lexically,
> >> not
> >> >> numerically), but I grok the idea that they might not be.
> >> >>
> >> >> The problem is that your transformation promotes a portion of the cq
> to
> >> the
> >> >> cf. That's fine if what your iterator is returning includes only that
> >> from
> >> >> a single cf (day, the 'data' cf). But otherwise, you could get
> >> duplicates
> >> >> or out of order results, which can mess up the client's expectations
> >> when
> >> >> retrieving batches from the servers. It could work in some limited
> >> cases,
> >> >> but I'd avoid it.
> >> >>
> >> >> Instead, why not preserve order by preserving the existing schema,
> and
> >> just
> >> >> ignore the unused cf in the client?
> >> >>
> >> >> On Thu, Apr 2, 2015, 00:28 Russ Weeks <rweeks@newbrightidea.com>
> wrote:
> >> >>
> >> >> > Thanks, Christopher. It's nice to hear an unambiguous point of
> view :)
> >> >> >
> >> >> > Do you see any alternative way of implementing a range scan on
a
> >> >> > partitioned index? The problem does not exist for exact-match
scans
> >> >> because
> >> >> > the row ID in the index entry CQ provides the correct ordering.
> >> >> >
> >> >> > Thanks,
> >> >> > -Russ
> >> >> >
> >> >> > On Wed, Apr 1, 2015 at 9:11 PM, Christopher <ctubbsii@apache.org>
> >> wrote:
> >> >> >
> >> >> > > You should definitely not rely on this behavior. It goes
against
> >> best
> >> >> > > practices and is prone to error. It is not recommended.
> >> >> > >
> >> >> > > On Wed, Apr 1, 2015, 20:03 Russ Weeks <rweeks@newbrightidea.com>
> >> >> wrote:
> >> >> > >
> >> >> > > > A wonderful property of scan-time iterators is that
they can
> emit
> >> row
> >> >> > IDs
> >> >> > > > in arbitrary order. Before I go off and build an index
that
> >> relies on
> >> >> > > this
> >> >> > > > behaviour, I'd like to get a sense of how likely it
is to
> exist in
> >> >> > future
> >> >> > > > versions of Accumulo.
> >> >> > > >
> >> >> > > > I'd like to build an index like this (hopefully the
ascii comes
> >> >> > through,
> >> >> > > if
> >> >> > > > not check here <
> >> >> https://gist.github.com/anonymous/1a64114da4b68a2ec822
> >> >> > > >):
> >> >> > > >
> >> >> > > >
> >> >> > > >  row   | cf  | cq                | val
> >> >> > > > -------------------------------------------------
> >> >> > > >  p0    | i   | (prop_a, 7, r15)  | 1
> >> >> > > >  p0    | i   | (prop_a, 8, r8)   | 1
> >> >> > > >  p0    | i   | (prop_a, 9, r19)  | 1
> >> >> > > > [...snip...]
> >> >> > > >  p0    | d   | (r8, prop_a)      | 8
> >> >> > > >  p0    | d   | (r8, prop_b)      | hello, world
> >> >> > > >  p0    | d   | (r15, prop_a)     | 7
> >> >> > > >  p0    | d   | (r15, prop_b)     | just testing
> >> >> > > >  p0    | d   | (r19, prop_a)     | 9
> >> >> > > >  p0    | d   | (r19, prop_b)     | something else
> >> >> > > >
> >> >> > > > Which is a pretty conventional partitioned index. I'd
like to
> be
> >> able
> >> >> > to
> >> >> > > > issue a query like, "Tell me about prop_b for all documents
> where
> >> >> > prop_a
> >> >> > > <
> >> >> > > > 9" but I'm pretty sure that the only way this could
work at
> scale
> >> is
> >> >> if
> >> >> > > > it's OK for the iterator to return (p0, r15, prop_b,
"just
> >> testing")
> >> >> > > > followed by (p0, r8, prop_b, "hello, world").
> >> >> > > >
> >> >> > > > This works today - if you folks see any flaws in my
reasoning
> >> please
> >> >> > let
> >> >> > > me
> >> >> > > > know - my question is, do you see this as functionality
that
> >> should
> >> >> be
> >> >> > > > preserved in the future?
> >> >> > > >
> >> >> > > > Thanks,
> >> >> > > > -Russ
> >> >> > > >
> >> >> > >
> >> >> >
> >> >>
> >>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message