accumulo-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Christopher <ctubb...@apache.org>
Subject Re: Scan-time iterators returning out-of-order rows
Date Thu, 02 Apr 2015 20:18:20 GMT
I'm not sure I understand how they'd be out of order in the iterator
if they aren't out of order in the underlying source.
How would your iterator return:
((p0, d, (r15, prop_b)), "just testing"),
((p0, d, (r8, prop_b)), "hello, world")

when the underlying data is:
 p0    | d   | (r8, prop_b)      | hello, world
 p0    | d   | (r15, prop_b)     | just testing

?

Why would it reorder the existing entries?

--
Christopher L Tubbs II
http://gravatar.com/ctubbsii


On Thu, Apr 2, 2015 at 1:52 PM, Russ Weeks <rweeks@newbrightidea.com> wrote:
> Thanks for your response, Christopher.
>
> Yes, I see what you mean by promoting the CQ to the CF. I thought that
> would simplify things but if not I could definitely return (k,v) pairs like,
>
> ((p0, d, (r15, prop_b)), "just testing"),
> ((p0, d, (r8, prop_b)), "hello, world")
>
> (leaving the timestamps and visibilities out for clarity, and assuming that
> r15 and r8 are encoded such that their lexical order matches their numeric
> order)
>
> Leaving the existing schema intact is not a problem for me but it doesn't
> get around the fact that the (k,v) pairs would be returned out-of-order by
> the iterator. I guess another option would be to embed the "data" kv pairs
> into the "index" kv pairs somehow, such as:
>
> ((p0, i, ((prop_a, 7, r15), prop_b)), "just testing")
> ((p0, i, ((prop_a, 8, r8), prop_b)), "hello, world")
>
> I'm not too keen on that solution but if you're telling me that I shouldn't
> rely on scan-time iterators being able to emit data out of order... I think
> it's the least bad option.
>
> Regards,
> -Russ
>
> On Thu, Apr 2, 2015 at 12:04 AM, Christopher <ctubbsii@apache.org> wrote:
>
>> So in your example, you actually did return then in order (lexically, not
>> numerically), but I grok the idea that they might not be.
>>
>> The problem is that your transformation promotes a portion of the cq to the
>> cf. That's fine if what your iterator is returning includes only that from
>> a single cf (day, the 'data' cf). But otherwise, you could get duplicates
>> or out of order results, which can mess up the client's expectations when
>> retrieving batches from the servers. It could work in some limited cases,
>> but I'd avoid it.
>>
>> Instead, why not preserve order by preserving the existing schema, and just
>> ignore the unused cf in the client?
>>
>> On Thu, Apr 2, 2015, 00:28 Russ Weeks <rweeks@newbrightidea.com> wrote:
>>
>> > Thanks, Christopher. It's nice to hear an unambiguous point of view :)
>> >
>> > Do you see any alternative way of implementing a range scan on a
>> > partitioned index? The problem does not exist for exact-match scans
>> because
>> > the row ID in the index entry CQ provides the correct ordering.
>> >
>> > Thanks,
>> > -Russ
>> >
>> > On Wed, Apr 1, 2015 at 9:11 PM, Christopher <ctubbsii@apache.org> wrote:
>> >
>> > > You should definitely not rely on this behavior. It goes against best
>> > > practices and is prone to error. It is not recommended.
>> > >
>> > > On Wed, Apr 1, 2015, 20:03 Russ Weeks <rweeks@newbrightidea.com>
>> wrote:
>> > >
>> > > > A wonderful property of scan-time iterators is that they can emit
row
>> > IDs
>> > > > in arbitrary order. Before I go off and build an index that relies
on
>> > > this
>> > > > behaviour, I'd like to get a sense of how likely it is to exist in
>> > future
>> > > > versions of Accumulo.
>> > > >
>> > > > I'd like to build an index like this (hopefully the ascii comes
>> > through,
>> > > if
>> > > > not check here <
>> https://gist.github.com/anonymous/1a64114da4b68a2ec822
>> > > >):
>> > > >
>> > > >
>> > > >  row   | cf  | cq                | val
>> > > > -------------------------------------------------
>> > > >  p0    | i   | (prop_a, 7, r15)  | 1
>> > > >  p0    | i   | (prop_a, 8, r8)   | 1
>> > > >  p0    | i   | (prop_a, 9, r19)  | 1
>> > > > [...snip...]
>> > > >  p0    | d   | (r8, prop_a)      | 8
>> > > >  p0    | d   | (r8, prop_b)      | hello, world
>> > > >  p0    | d   | (r15, prop_a)     | 7
>> > > >  p0    | d   | (r15, prop_b)     | just testing
>> > > >  p0    | d   | (r19, prop_a)     | 9
>> > > >  p0    | d   | (r19, prop_b)     | something else
>> > > >
>> > > > Which is a pretty conventional partitioned index. I'd like to be able
>> > to
>> > > > issue a query like, "Tell me about prop_b for all documents where
>> > prop_a
>> > > <
>> > > > 9" but I'm pretty sure that the only way this could work at scale
is
>> if
>> > > > it's OK for the iterator to return (p0, r15, prop_b, "just testing")
>> > > > followed by (p0, r8, prop_b, "hello, world").
>> > > >
>> > > > This works today - if you folks see any flaws in my reasoning please
>> > let
>> > > me
>> > > > know - my question is, do you see this as functionality that should
>> be
>> > > > preserved in the future?
>> > > >
>> > > > Thanks,
>> > > > -Russ
>> > > >
>> > >
>> >
>>

Mime
View raw message