accumulo-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Russ Weeks <>
Subject Re: Using iterators to generate data
Date Tue, 02 Sep 2014 05:01:45 GMT
Hi, William,

Thanks very much for your response. I get that it's not supported or
desirable for an Iterator to instantiate a scanner or writer. It's sort of
analogous to opening a JDBC connection from inside a stored procedure -
lots of reasons why that would be a bad idea. I'm more interested in the
case where an iterator that processes input A, B, C, D might emit values A,
A1=f(A), B, B1=f(B) etc. Under what conditions is it safe to use iterators
this way? It seems there are at least two constraints: A1 must sort
lexicographically between A and B (otherwise the iterator could emit data
out of order), and A1 must be in the same row as A (otherwise A1 might
properly be handled by a different tablet server).

Seems like the consensus is to use MR for this sort of thing. I'm
definitely keeping an eye on fluo though, looks like a very cool project!


On Sat, Aug 30, 2014 at 12:20 AM, William Slacum <> wrote:

> This comes up a bit, so maybe we should add it to the FAQ (or just have
> better information about iterators in general). The short answer is that
> it's usually not recommended, because there aren't strong guarantees about
> the lifetime of an iterator (so we wouldn't know when to close any
> resources held by an iterator instance, such as batch writer thread pools)
> and there's 0 resource management related to tablet server-to-tablet server
> communications.
> Check out Fluo, made by our own "Chief" Keith Turner & Mike "The Trike"
> Walch:
> It's an implementation of Google's percolator, which provides the
> capability to handle "new" data server side as well as transactional
> guarantees.
> On Fri, Aug 29, 2014 at 5:09 PM, Russ Weeks <>
> wrote:
>> There are plenty of examples of using custom iterators to filter or
>> combine data at either the cell level or the row level. In these cases, the
>> amount of data coming out of the iterator is less than the amount going in.
>> What about going the other direction, using a custom iterator to generate
>> new data based on the contents of a cell or a row? I guess this is also
>> what a combiner does but bear with me...
>> The immediately obvious use case is parsing. Suppose one cell in my row
>> holds an XML document. I'd like to configure an iterator with an XPath
>> expression to pull a field out of the document, so that I can leverage the
>> distributed processing of the cluster instead of parsing the doc on the
>> scanner-side.
>> I'm sure there are constraints or things to watch out for, does anybody
>> have any recommendations here? For instance, the generated cells would
>> probably have to be in the same row as the input cells?
>> I'm using MapReduce to satisfy all these use cases right now but I'm
>> interested to know how much of my code could be ported to Iterators.
>> Thanks!
>> -Russ

View raw message