accumulo-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Adam Fuchs <afu...@apache.org>
Subject Re: multi-table isolated batch scanner
Date Tue, 16 Apr 2013 16:33:13 GMT
Keith,

I'm essentially performing a local sort-merge inner join (local meaning
within one tablet at a time). This is similar to what the intersecting
iterator does, but I'm doing it on data that is not already sorted.
Applications of this include any kind of indexed boolean logic predicate
search in which one or more of the predicates cannot be tied to an exact
term, but can be tied to a range of terms. Wildcard searches, multi-hop
graph searches, and range queries all fit this pattern.

After the join there is additional processing done on the results. I would
like to keep some of the intermediate data buffered, like the sorted set of
IDs that match a predicate, even though these are not directly part of the
output data. The multiple passes are needed for when the intermediate data
is too big to hold in memory (while maintaining concurrency, etc.), but
they don't really complicate the underlying problem.

Adam



On Tue, Apr 16, 2013 at 9:29 AM, Keith Turner <keith@deenlo.com> wrote:

> On Mon, Apr 15, 2013 at 6:19 PM, Adam Fuchs <afuchs@apache.org> wrote:
> > Keith,
> >
> > In this case we're filling the buffer before we can amortize the search
> > cost. We're using a document-partitioned table design and we have to do a
> > local sort before we can get the first result.
> >
>
> I am not sure exactly what you are doing.  To me it sounds like you
> may be doing the following, but not sure.
>
>  * Seeking N iterators
>  * Doing what the intersecting iterator does, joining docids or
> seeking iterators
>  * Collecting a set of key values (docids or documents?) and sorting
> them?  How much is collected before sort?  Why sort?  Is filtering
> done after the sort?
>
> Or did you meant something else by 'local sort' ? like a sort on the
> client side?  But this does not seem the entire story as you mentioned
> something about multiple passes.  Are you hash partitioning the data?
> Are you building something in memoery besides the buffered output
> data?
>
>
> > I have found that increasing the buffer size also increases the latency
> for
> > getting the first results. This application is both latency and
> throughput
> > sensitive. In addition, increasing the batch size too much puts
> significant
> > memory requirements on the process running the batch scanner.
> >
> > Adam
> >
> >
> >
> > On Mon, Apr 15, 2013 at 5:33 PM, Keith Turner <keith@deenlo.com> wrote:
> >
> >> On Mon, Apr 15, 2013 at 5:06 PM, Adam Fuchs <afuchs@apache.org> wrote:
> >> > Chris,
> >> >
> >> > The desire for isolation stems from the desire to amortize some
> >> computation
> >> > over a number of results. Say it takes 5 seconds to compute an
> >> intersection
> >>
> >> Would increasing the size of the key/value buffer help in your case?
> >> The iterator stack is not torn down until that buffer fills up or the
> >> end of tablet is reached.  Are you concerned about the cost of
> >> reconstructing the iterator stack across tablets?
> >>
> >> > of a couple of sets within the iterators, and then streaming back the
> >> > results takes a minute or so. If I have to redo the 5 second
> computation
> >> > many times, as in to support the reconstruction of the iterator tree,
> >> then
> >> > that computation may start to dominate my query performance.
> Primarily,
> >> > this means I need to be able to continue a scan without having to
> rebuild
> >> > the iterators. Isolation in the scanner has that side effect. Proper
> >> > isolation would be a "nice-to-have", but I can deal with not having
> it.
> >> >
> >> > Adam
> >> >
> >> >
> >> >
> >> > On Mon, Apr 15, 2013 at 4:13 PM, Christopher <ctubbsii@apache.org>
> >> wrote:
> >> >
> >> >> Adam-
> >> >>
> >> >> It seems like you're talking about two features at once:
> >> >> 1) Multi-table batch scanner.
> >> >> 2) Scan Isolation on batch scanners like we have on regular scanners.
> >> >> Is that correct?
> >> >>
> >> >> I can see the utility of a multi-table batch scanner, but I haven't
> >> >> seen a compelling need for implementing isolation on the
> >> >> batch-scanners. Do you have a use case in mind for that?
> >> >>
> >> >> Also, it seems that your use case for isolation is not so much the
> >> >> isolated reads, but the statefulness of the iterator stack on the
> >> >> server side. Is this correct? If so, I'm even more curious about your
> >> >> use case for this, since that statefulness is only guaranteed
> per-row.
> >> >>
> >> >>
> >> >> --
> >> >> Christopher L Tubbs II
> >> >> http://gravatar.com/ctubbsii
> >> >>
> >> >>
> >> >> On Mon, Apr 15, 2013 at 3:10 PM, Adam Fuchs <afuchs@apache.org>
> wrote:
> >> >> > Thanks Bill,
> >> >> >
> >> >> > I care about latency and throughput. First available result
> ordering
> >> is
> >> >> > fine, though.
> >> >> >
> >> >> > Does Guava just chain through a collection of iterators, completing
> >> one
> >> >> > then moving to the next?
> >> >> >
> >> >> > Adam
> >> >> >
> >> >> >
> >> >> >
> >> >> > On Mon, Apr 15, 2013 at 3:06 PM, William Slacum <
> >> >> > wilhelm.von.cloud@accumulo.net> wrote:
> >> >> >
> >> >> >> How are you expecting to get results back? Guava's Iterables
could
> >> >> concat a
> >> >> >> bunch of a Scanners together, if you didn't care about the
> throughput
> >> >> >> aspect of it and simply wanted results from multiple tables.
> >> >> >>
> >> >> >> On Mon, Apr 15, 2013 at 3:00 PM, Adam Fuchs <afuchs@apache.org>
> >> wrote:
> >> >> >>
> >> >> >> > Is anyone else pining for a multi-table isolated batch
scanner,
> or
> >> is
> >> >> it
> >> >> >> > just me? I like the automatic parallelism and balancing
of the
> >> batch
> >> >> >> > scanner, but I'm looking to maintain server-side state
in my
> >> iterators
> >> >> >> over
> >> >> >> > long-running scans. I would also like to scan over multiple
> tables
> >> >> >> > concurrently. Has anyone tried hacking something together
with a
> >> pool
> >> >> of
> >> >> >> > non-batch scanners?
> >> >> >> >
> >> >> >> > Adam
> >> >> >> >
> >> >> >>
> >> >>
> >>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message