accumulo-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Keith Turner <ke...@deenlo.com>
Subject Re: Using iterators to generate data
Date Tue, 02 Sep 2014 14:09:53 GMT
Does extending org.apache.accumulo.core.iterators.user.TransformingIterator
meet your needs?

Transforming data is tricky.  You have covered some of the issues, such as
making sure you generate sorted data within the tablets range.  Also need
to handle the reseek case.  Accumulo reads batches of data.  At any point
it could take the last key the iterator returned and reseek non-inclusive.
For example if you initially seek [A,R] and your iterator returns keys
B,F,L,N,Q.  You iterator should work correctly w/ the following seek ranges
(B,R], (F,R], (L,R], (N,R], and (Q,R].




On Tue, Sep 2, 2014 at 1:01 AM, Russ Weeks <rweeks@newbrightidea.com> wrote:

> Hi, William,
>
> Thanks very much for your response. I get that it's not supported or
> desirable for an Iterator to instantiate a scanner or writer. It's sort of
> analogous to opening a JDBC connection from inside a stored procedure -
> lots of reasons why that would be a bad idea. I'm more interested in the
> case where an iterator that processes input A, B, C, D might emit values A,
> A1=f(A), B, B1=f(B) etc. Under what conditions is it safe to use iterators
> this way? It seems there are at least two constraints: A1 must sort
> lexicographically between A and B (otherwise the iterator could emit data
> out of order), and A1 must be in the same row as A (otherwise A1 might
> properly be handled by a different tablet server).
>
> Seems like the consensus is to use MR for this sort of thing. I'm
> definitely keeping an eye on fluo though, looks like a very cool project!
>
> -Russ
>
>
> On Sat, Aug 30, 2014 at 12:20 AM, William Slacum <
> wilhelm.von.cloud@accumulo.net> wrote:
>
>> This comes up a bit, so maybe we should add it to the FAQ (or just have
>> better information about iterators in general). The short answer is that
>> it's usually not recommended, because there aren't strong guarantees about
>> the lifetime of an iterator (so we wouldn't know when to close any
>> resources held by an iterator instance, such as batch writer thread pools)
>> and there's 0 resource management related to tablet server-to-tablet server
>> communications.
>>
>> Check out Fluo, made by our own "Chief" Keith Turner & Mike "The Trike"
>> Walch: https://github.com/fluo-io/fluo
>>
>> It's an implementation of Google's percolator, which provides the
>> capability to handle "new" data server side as well as transactional
>> guarantees.
>>
>>
>> On Fri, Aug 29, 2014 at 5:09 PM, Russ Weeks <rweeks@newbrightidea.com>
>> wrote:
>>
>>> There are plenty of examples of using custom iterators to filter or
>>> combine data at either the cell level or the row level. In these cases, the
>>> amount of data coming out of the iterator is less than the amount going in.
>>> What about going the other direction, using a custom iterator to generate
>>> new data based on the contents of a cell or a row? I guess this is also
>>> what a combiner does but bear with me...
>>>
>>> The immediately obvious use case is parsing. Suppose one cell in my row
>>> holds an XML document. I'd like to configure an iterator with an XPath
>>> expression to pull a field out of the document, so that I can leverage the
>>> distributed processing of the cluster instead of parsing the doc on the
>>> scanner-side.
>>>
>>> I'm sure there are constraints or things to watch out for, does anybody
>>> have any recommendations here? For instance, the generated cells would
>>> probably have to be in the same row as the input cells?
>>>
>>> I'm using MapReduce to satisfy all these use cases right now but I'm
>>> interested to know how much of my code could be ported to Iterators.
>>>
>>> Thanks!
>>> -Russ
>>>
>>
>>
>

Mime
View raw message