accumulo-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Josh Elser <josh.el...@gmail.com>
Subject Re: Seeking Iterator
Date Mon, 12 Jan 2015 21:10:40 GMT
Eugene Cheipesh wrote:
> That is the idea I am playing with, optimizing subsequent calls to .next
> vs .reseek.
>
> Using the VersioningIterator as an I was able to get reseek working, the
> trick turned out to be to check getSource.hasTop a little more
> carefully. Thank you for that pointer. Disappointingly enough there does
> not appear to be a huge difference in performance from a Filter that
> performs the same checking without seeking forward.

seek()'ing doesn't always imply an increase in performance -- remember 
that RFiles (the files that back Accumulo tables), are composed of 
multiple blocks/sections with an index of them. A seek is comprised of 
using that index to find the block/section of the RFile and then a 
linear scan forward to find the first key for the range you seek()'ed to.

Thus, if you're repeatedly re-seek()'ing within the same block, you'll 
waste a lot of time re-read the same data. In your situation, it sounds 
like the cost of re-reading the data after a seek is about the same as 
naively consuming the records.

You can try altering table.file.compress.blocksize (and then compacting 
your table) to see how this changes.

> I am attempting to use accumulo tracing to come up with an explanation
> but results are spotty. Query runtime is ~2.3s for 200k entries and I am
> only capturing tiny fraction of that through tracing.

You mean that your iterator isn't accounting for a majority of the 
execution time? Are you bounded by the time it takes to read the data 
from disk?

> I am using
> AccumuloInputFormat to pull results into a spark job. Would I get better
> results if I was creating the BatchScanner directly? Otherwise what
> would be the best method to debug the iterator and query performance?

Are there 200k entries in the table or are you extracting 200k out of N 
entries? Hooking into tracing is definitely the right approach to take 
to get perf information.

> --
> Eugene Cheipesh
>
> From: Russ Weeks <rweeks@newbrightidea.com>
> <mailto:rweeks@newbrightidea.com>
> Reply: user@accumulo.apache.org <user@accumulo.apache.org>>
> <mailto:user@accumulo.apache.org>
> Date: January 9, 2015 at 11:32:13 PM
> To: user@accumulo.apache.org <user@accumulo.apache.org>>
> <mailto:user@accumulo.apache.org>
> Subject: Re: Seeking Iterator
>
>> On Fri, Jan 9, 2015 at 7:56 PM, Christopher <ctubbsii@apache.org
>> <mailto:ctubbsii@apache.org>> wrote:
>>
>>     Another optimization you can try: instead of always seeking to the
>>     computed next, you can advance internally inside your iterator by
>>     calling its source's next method a few times. If you don't reach
>>     the next element that you would have seek'd to in some reasonable
>>     number of iterations, you can then seek. This also is a strategy
>>     that is hard to optimize: Do I need to advance, on average 3 or 20
>>     or 10000000 keys? How many before it would have been more
>>     efficient to just seek? There's no easy answer. Experimentation helps.
>>
>>
>> The VersioningIterator has a good example of this approach:
>> https://github.com/apache/accumulo/blob/901d60ef1cf72c2d55c90746fce94e108a992d3b/core/src/main/java/org/apache/accumulo/core/iterators/user/VersioningIterator.java#L95
>>
>> -Russ
>>
>>
>>
>>     --
>>     Christopher L Tubbs II
>>     http://gravatar.com/ctubbsii
>>
>>     On Fri, Jan 9, 2015 at 6:54 PM, Eugene Cheipesh
>>     <echeipesh@gmail.com <mailto:echeipesh@gmail.com>> wrote:
>>
>>         That’s would work well enough and is my next choice.
>>
>>         The thought was, rows are stored in increasing order, so as
>>         long as I know when I walked off the edge, and flag the
>>         iterator as empty it’d be good. I’m just chasing the optimal
>>         in this case, but if it doesn’t exist, oh well.
>>
>>         Thank you for the reference link, it’s very helpful.
>>
>>         --
>>         Eugene Cheipesh
>>
>>         From: Russ Weeks <rweeks@newbrightidea.com>
>>         <mailto:rweeks@newbrightidea.com>
>>         Reply: user@accumulo.apache.org
>>         <mailto:user@accumulo.apache.org> <user@accumulo.apache.org>>
>>         <mailto:user@accumulo.apache.org>
>>         Date: January 9, 2015 at 6:48:47 PM
>>         To: user@accumulo.apache.org <mailto:user@accumulo.apache.org>
>>         <user@accumulo.apache.org>> <mailto:user@accumulo.apache.org>
>>         Subject: Re: Seeking Iterator
>>
>>>         Hi, Eugene,
>>>
>>>         I think the conventional approach is to decompose your search
>>>         area (bounding box?) into a set of scan ranges that introduce
>>>         minimal extraneous curve segments, and then pass all those
>>>         scan ranges into a BatchScanner. The excellent Accumulo
>>>         Recipes site has an example[1]. Does this approach not work
>>>         for you?
>>>
>>>         In general, your custom iterator should never try to seek to
>>>         a row id different from the current row id, because that row
>>>         could be hosted by a different tablet server.
>>>
>>>         -Russ
>>>
>>>         1:
>>>         https://github.com/calrissian/accumulo-recipes/blob/master/store/geospatial-store/src/main/java/org/calrissian/accumulorecipes/geospatialstore/support/QuadTreeHelper.java#L33
>>>
>>>         On Fri, Jan 9, 2015 at 2:37 PM, Eugene Cheipesh
>>>         <echeipesh@gmail.com <mailto:echeipesh@gmail.com>> wrote:
>>>
>>>             Hello,
>>>
>>>             I am attempting to write an Iterator based on a Z-curve
>>>             index to search through multi-dimensional data.
>>>             Essentially, given a record that I have encountered that
>>>             is in the index range not in the multi-demensional query
>>>             range I have a way to generate the next candidate record,
>>>             potentially far ahead of the current point.
>>>
>>>             Ideally I would be able to refine my search range with
>>>             subsequent calls to seek(). It appears that Accumulo will
>>>             create an iterator for every RFile (or some split other
>>>             split point). The beginning of the range argument to seek
>>>             will be the record at beginning of this split (which is
>>>             good), however all instances of the iterator have the
>>>             same, global range end (which is bad).
>>>
>>>             I need to avoid the case where I seek past the range
>>>             boundary of each individual iterator instance and throw a
>>>             NullPointerException. Is there any way to get enough
>>>             information to achieve this?
>>>
>>>             Thank you,
>>>
>>>             --
>>>             Eugene Cheipesh
>>>
>>>
>>
>>

Mime
View raw message