accumulo-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Josh Elser <josh.el...@gmail.com>
Subject Re: is there any "trick" to save the state of an iterator?
Date Tue, 10 Jan 2017 02:37:32 GMT
And yet, Accumulo still doesn't have the API to safely do it.

See ACCUMULO-1280 if you'd like to contribute towards to those efforts for
the community.

On Jan 9, 2017 20:23, "Jeremy Kepner" <kepner@ll.mit.edu> wrote:

> It's done in D4M (d4m.mit.edu), you might look there.
> Dylan can explain (if necessary).
> Regards.  -Jeremy
>
> On Mon, Jan 09, 2017 at 07:30:03PM -0500, Josh Elser wrote:
> > Great. Glad I wasn't derailing things :)
> >
> > Unfortunately, I don't think this is a very well-documented area of the
> > code (it's quite advanced and would just confuse most users).
> >
> > I'll have to think about it some more and see if I can come up with
> > anything clever. I know there are some others subscribed to this list
> > who might be more clever than I am -- I'm sure they'll weigh in if they
> > have any suggestions.
> >
> > Finally, if you're interested in helping us put together some sort of
> > "advanced indexing" docs for the project, I'm sure we could find a few
> > people who would be happy to get something published on the Accumulo
> > website.
> >
> > Massimilian Mattetti wrote:
> > > Thank you for your answer John, you understood perfectly what my use
> > > case is.
> > >
> > > The possible solutions that you propose came to mind to me, too. This
> > > confirms to me that, unfortunately, there is no fancy way to overcome
> > > this problem.
> > >
> > > Is there any good documentation on different query planning for
> Accumulo
> > > that could help with my use case?
> > > Thanks.
> > >
> > > Regards,
> > > Max
> > >
> > >
> > >
> > >
> > > From: Josh Elser <josh.elser@gmail.com>
> > > To: user@accumulo.apache.org
> > > Date: 09/01/2017 21:55
> > > Subject: Re: is there any "trick" to save the state of an iterator?
> > > ------------------------------------------------------------
> ------------
> > >
> > >
> > >
> > > Hey Max,
> > >
> > > There is no provided mechanism to do this, and this is a problem with
> > > supporting "range queries". I'm hoping I'm understanding your use-case
> > > correctly; sorry in advance if I'm going off on a tangent.
> > >
> > > When performing the standard sort-merge join across some columns to
> > > implement intersections and unions, the un-sorted range of values you
> > > want to scan over (500k-600k) breaks the ordering of the docIds which
> > > you are trying to catch.
> > >
> > > The trivial solution is to convert a range into a union of discrete
> > > values (500000 || 500001 || 500002 || ..) but you can see how this
> > > quickly falls apart. An inverted index could be used to enumerate the
> > > values that exist in the range.
> > >
> > > Another trivial solution would be to select all records matching the
> > > smaller condition, and then post-filter the other condition.
> > >
> > > There might be some more trickier query planning decisions you could
> > > also experiment with (I'd have to give it lots more thought). In short,
> > > I'd recommend against trying to solve the problem via saving state.
> > > Architecturally, this is just not something that Accumulo Iterators are
> > > designed to support at this time.
> > >
> > > - Josh
> > >
> > > Massimilian Mattetti wrote:
> > >  > Hi all,
> > >  >
> > >  > I am working with a Document-Partitioned Index table whose index
> > >  > sections are accessed using ranges over the indexed properties (e.g.
> > >  > property A ∈ [500,000 - 600,000], property B ∈ [0.1 - 0.4], etc.).
> The
> > >  > iterator that handles this table works by: 1st - calculating (doing
> > >  > intersection and union on different properties) all the result from
> the
> > >  > index section of a single bin; 2nd - using the ids retrieved from
> the
> > >  > index, it goes over the data section of the specific bin.
> > >  > This iterator has proved to have significant performance penalty
> > >  > whenever the amount of data retrieved from the index is orders of
> > >  > magnitude bigger than the table_scan_max_memory i.e. the iterator is
> > >  > teardown tens of times for each bin. Since there is no explicit way
> to
> > >  > save the state of an iterator, is there any other mechanism/approach
> > >  > that I could use/follow in order to avoid to re-calculate the index
> > >  > result set after each teardown?
> > >  > Thanks.
> > >  >
> > >  >
> > >  > Regards,
> > >  > Max
> > >  >
> > > .
> > >
> > >
> > >
> > >
>

Mime
View raw message