accumulo-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From William Slacum <wilhelm.von.cl...@accumulo.net>
Subject Re: Using iterators to generate data
Date Tue, 02 Sep 2014 14:03:48 GMT
Ah I see. You're correct about the ordering. How different would your key
be? Another thing to consider is that if you are returning a generated key
that's not actually in the data, your iterator needs to handle the case
where it is reseek'd with a range that has an exclusive start on a
generated key. You'd have to potentially recompute results if you return
multiple generated keys.


On Tue, Sep 2, 2014 at 1:01 AM, Russ Weeks <rweeks@newbrightidea.com> wrote:

> Hi, William,
>
> Thanks very much for your response. I get that it's not supported or
> desirable for an Iterator to instantiate a scanner or writer. It's sort of
> analogous to opening a JDBC connection from inside a stored procedure -
> lots of reasons why that would be a bad idea. I'm more interested in the
> case where an iterator that processes input A, B, C, D might emit values A,
> A1=f(A), B, B1=f(B) etc. Under what conditions is it safe to use iterators
> this way? It seems there are at least two constraints: A1 must sort
> lexicographically between A and B (otherwise the iterator could emit data
> out of order), and A1 must be in the same row as A (otherwise A1 might
> properly be handled by a different tablet server).
>
> Seems like the consensus is to use MR for this sort of thing. I'm
> definitely keeping an eye on fluo though, looks like a very cool project!
>
> -Russ
>
>
> On Sat, Aug 30, 2014 at 12:20 AM, William Slacum <
> wilhelm.von.cloud@accumulo.net> wrote:
>
>> This comes up a bit, so maybe we should add it to the FAQ (or just have
>> better information about iterators in general). The short answer is that
>> it's usually not recommended, because there aren't strong guarantees about
>> the lifetime of an iterator (so we wouldn't know when to close any
>> resources held by an iterator instance, such as batch writer thread pools)
>> and there's 0 resource management related to tablet server-to-tablet server
>> communications.
>>
>> Check out Fluo, made by our own "Chief" Keith Turner & Mike "The Trike"
>> Walch: https://github.com/fluo-io/fluo
>>
>> It's an implementation of Google's percolator, which provides the
>> capability to handle "new" data server side as well as transactional
>> guarantees.
>>
>>
>> On Fri, Aug 29, 2014 at 5:09 PM, Russ Weeks <rweeks@newbrightidea.com>
>> wrote:
>>
>>> There are plenty of examples of using custom iterators to filter or
>>> combine data at either the cell level or the row level. In these cases, the
>>> amount of data coming out of the iterator is less than the amount going in.
>>> What about going the other direction, using a custom iterator to generate
>>> new data based on the contents of a cell or a row? I guess this is also
>>> what a combiner does but bear with me...
>>>
>>> The immediately obvious use case is parsing. Suppose one cell in my row
>>> holds an XML document. I'd like to configure an iterator with an XPath
>>> expression to pull a field out of the document, so that I can leverage the
>>> distributed processing of the cluster instead of parsing the doc on the
>>> scanner-side.
>>>
>>> I'm sure there are constraints or things to watch out for, does anybody
>>> have any recommendations here? For instance, the generated cells would
>>> probably have to be in the same row as the input cells?
>>>
>>> I'm using MapReduce to satisfy all these use cases right now but I'm
>>> interested to know how much of my code could be ported to Iterators.
>>>
>>> Thanks!
>>> -Russ
>>>
>>
>>
>

Mime
View raw message