accumulo-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From William Slacum <>
Subject Re: Using iterators to generate data
Date Sat, 30 Aug 2014 07:20:25 GMT
This comes up a bit, so maybe we should add it to the FAQ (or just have
better information about iterators in general). The short answer is that
it's usually not recommended, because there aren't strong guarantees about
the lifetime of an iterator (so we wouldn't know when to close any
resources held by an iterator instance, such as batch writer thread pools)
and there's 0 resource management related to tablet server-to-tablet server

Check out Fluo, made by our own "Chief" Keith Turner & Mike "The Trike"

It's an implementation of Google's percolator, which provides the
capability to handle "new" data server side as well as transactional

On Fri, Aug 29, 2014 at 5:09 PM, Russ Weeks <>

> There are plenty of examples of using custom iterators to filter or
> combine data at either the cell level or the row level. In these cases, the
> amount of data coming out of the iterator is less than the amount going in.
> What about going the other direction, using a custom iterator to generate
> new data based on the contents of a cell or a row? I guess this is also
> what a combiner does but bear with me...
> The immediately obvious use case is parsing. Suppose one cell in my row
> holds an XML document. I'd like to configure an iterator with an XPath
> expression to pull a field out of the document, so that I can leverage the
> distributed processing of the cluster instead of parsing the doc on the
> scanner-side.
> I'm sure there are constraints or things to watch out for, does anybody
> have any recommendations here? For instance, the generated cells would
> probably have to be in the same row as the input cells?
> I'm using MapReduce to satisfy all these use cases right now but I'm
> interested to know how much of my code could be ported to Iterators.
> Thanks!
> -Russ

View raw message