accumulo-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Massimilian Mattetti" <MASSI...@il.ibm.com>
Subject Re: is there any "trick" to save the state of an iterator?
Date Sun, 15 Jan 2017 07:34:43 GMT
I am trying a simple two steps strategy in which first an iterator looks 
for all the unique values of a property that fall inside the queried 
range, and then if the number of unique values overcomes a pre-defined 
threshold I give up with the index and I go with a full scan over the 
data, otherwise another set of iterators compute the intersection and the 
union on the index using the values retrieved by the previous iterator. In 
this way, if I have a query like A ∈ [500,000 - 600,000] and the seek 
range it is fairly small, I may end up with just few unique values for 
property A. The wikisearch example inspired to me this approach. In 
particular I am looking at the UniqFieldNameValueIterator for implementing 
the first iterator, although I am not sure it works correctly. Has anybody 
ever played with it?
Thanks.


Regards,
Max




From:   Christopher <ctubbsii@apache.org>
To:     user@accumulo.apache.org, "Kepner, Jeremy - 0553 - MITLL" 
<kepner@ll.mit.edu>
Date:   10/01/2017 04:46
Subject:        Re: is there any "trick" to save the state of an iterator?



FWIW, there is an open pull request on that issue that puts the work very 
near to completion. It could probably use a bit more testing and review, 
though.

On Mon, Jan 9, 2017 at 9:37 PM Josh Elser <josh.elser@gmail.com> wrote:
And yet, Accumulo still doesn't have the API to safely do it.

See ACCUMULO-1280 if you'd like to contribute towards to those efforts for 
the community.

On Jan 9, 2017 20:23, "Jeremy Kepner" <kepner@ll.mit.edu> wrote:
It's done in D4M (d4m.mit.edu), you might look there.
Dylan can explain (if necessary).
Regards.  -Jeremy

On Mon, Jan 09, 2017 at 07:30:03PM -0500, Josh Elser wrote:
> Great. Glad I wasn't derailing things :)
>
> Unfortunately, I don't think this is a very well-documented area of the
> code (it's quite advanced and would just confuse most users).
>
> I'll have to think about it some more and see if I can come up with
> anything clever. I know there are some others subscribed to this list
> who might be more clever than I am -- I'm sure they'll weigh in if they
> have any suggestions.
>
> Finally, if you're interested in helping us put together some sort of
> "advanced indexing" docs for the project, I'm sure we could find a few
> people who would be happy to get something published on the Accumulo
> website.
>
> Massimilian Mattetti wrote:
> > Thank you for your answer John, you understood perfectly what my use
> > case is.
> >
> > The possible solutions that you propose came to mind to me, too. This
> > confirms to me that, unfortunately, there is no fancy way to overcome
> > this problem.
> >
> > Is there any good documentation on different query planning for 
Accumulo
> > that could help with my use case?
> > Thanks.
> >
> > Regards,
> > Max
> >
> >
> >
> >
> > From: Josh Elser <josh.elser@gmail.com>
> > To: user@accumulo.apache.org
> > Date: 09/01/2017 21:55
> > Subject: Re: is there any "trick" to save the state of an iterator?
> > 
------------------------------------------------------------------------
> >
> >
> >
> > Hey Max,
> >
> > There is no provided mechanism to do this, and this is a problem with
> > supporting "range queries". I'm hoping I'm understanding your use-case
> > correctly; sorry in advance if I'm going off on a tangent.
> >
> > When performing the standard sort-merge join across some columns to
> > implement intersections and unions, the un-sorted range of values you
> > want to scan over (500k-600k) breaks the ordering of the docIds which
> > you are trying to catch.
> >
> > The trivial solution is to convert a range into a union of discrete
> > values (500000 || 500001 || 500002 || ..) but you can see how this
> > quickly falls apart. An inverted index could be used to enumerate the
> > values that exist in the range.
> >
> > Another trivial solution would be to select all records matching the
> > smaller condition, and then post-filter the other condition.
> >
> > There might be some more trickier query planning decisions you could
> > also experiment with (I'd have to give it lots more thought). In 
short,
> > I'd recommend against trying to solve the problem via saving state.
> > Architecturally, this is just not something that Accumulo Iterators 
are
> > designed to support at this time.
> >
> > - Josh
> >
> > Massimilian Mattetti wrote:
> >  > Hi all,
> >  >
> >  > I am working with a Document-Partitioned Index table whose index
> >  > sections are accessed using ranges over the indexed properties 
(e.g.
> >  > property A ∈ [500,000 - 600,000], property B ∈ [0.1 - 0.4], 
etc.). The
> >  > iterator that handles this table works by: 1st - calculating (doing
> >  > intersection and union on different properties) all the result from 
the
> >  > index section of a single bin; 2nd - using the ids retrieved from 
the
> >  > index, it goes over the data section of the specific bin.
> >  > This iterator has proved to have significant performance penalty
> >  > whenever the amount of data retrieved from the index is orders of
> >  > magnitude bigger than the table_scan_max_memory i.e. the iterator 
is
> >  > teardown tens of times for each bin. Since there is no explicit way 
to
> >  > save the state of an iterator, is there any other 
mechanism/approach
> >  > that I could use/follow in order to avoid to re-calculate the index
> >  > result set after each teardown?
> >  > Thanks.
> >  >
> >  >
> >  > Regards,
> >  > Max
> >  >
> > .
> >
> >
> >
> >
-- 
Christopher




Mime
View raw message