accumulo-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Josh Elser <josh.el...@gmail.com>
Subject Re: is there any "trick" to save the state of an iterator?
Date Tue, 10 Jan 2017 00:30:03 GMT
Great. Glad I wasn't derailing things :)

Unfortunately, I don't think this is a very well-documented area of the
code (it's quite advanced and would just confuse most users).

I'll have to think about it some more and see if I can come up with
anything clever. I know there are some others subscribed to this list
who might be more clever than I am -- I'm sure they'll weigh in if they
have any suggestions.

Finally, if you're interested in helping us put together some sort of
"advanced indexing" docs for the project, I'm sure we could find a few
people who would be happy to get something published on the Accumulo
website.

Massimilian Mattetti wrote:
> Thank you for your answer John, you understood perfectly what my use 
> case is.
> 
> The possible solutions that you propose came to mind to me, too. This 
> confirms to me that, unfortunately, there is no fancy way to overcome 
> this problem.
> 
> Is there any good documentation on different query planning for Accumulo 
> that could help with my use case?
> Thanks.
> 
> Regards,
> Max
> 
> 
> 
> 
> From: Josh Elser <josh.elser@gmail.com>
> To: user@accumulo.apache.org
> Date: 09/01/2017 21:55
> Subject: Re: is there any "trick" to save the state of an iterator?
> ------------------------------------------------------------------------
> 
> 
> 
> Hey Max,
> 
> There is no provided mechanism to do this, and this is a problem with
> supporting "range queries". I'm hoping I'm understanding your use-case
> correctly; sorry in advance if I'm going off on a tangent.
> 
> When performing the standard sort-merge join across some columns to
> implement intersections and unions, the un-sorted range of values you
> want to scan over (500k-600k) breaks the ordering of the docIds which
> you are trying to catch.
> 
> The trivial solution is to convert a range into a union of discrete
> values (500000 || 500001 || 500002 || ..) but you can see how this
> quickly falls apart. An inverted index could be used to enumerate the
> values that exist in the range.
> 
> Another trivial solution would be to select all records matching the
> smaller condition, and then post-filter the other condition.
> 
> There might be some more trickier query planning decisions you could
> also experiment with (I'd have to give it lots more thought). In short,
> I'd recommend against trying to solve the problem via saving state.
> Architecturally, this is just not something that Accumulo Iterators are
> designed to support at this time.
> 
> - Josh
> 
> Massimilian Mattetti wrote:
>  > Hi all,
>  >
>  > I am working with a Document-Partitioned Index table whose index
>  > sections are accessed using ranges over the indexed properties (e.g.
>  > property A ∈ [500,000 - 600,000], property B ∈ [0.1 - 0.4], etc.). The
>  > iterator that handles this table works by: 1st - calculating (doing
>  > intersection and union on different properties) all the result from the
>  > index section of a single bin; 2nd - using the ids retrieved from the
>  > index, it goes over the data section of the specific bin.
>  > This iterator has proved to have significant performance penalty
>  > whenever the amount of data retrieved from the index is orders of
>  > magnitude bigger than the table_scan_max_memory i.e. the iterator is
>  > teardown tens of times for each bin. Since there is no explicit way to
>  > save the state of an iterator, is there any other mechanism/approach
>  > that I could use/follow in order to avoid to re-calculate the index
>  > result set after each teardown?
>  > Thanks.
>  >
>  >
>  > Regards,
>  > Max
>  >
> .
> 
> 
> 
> 

Mime
View raw message