accumulo-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dylan Hutchison <dhutc...@cs.washington.edu>
Subject Re: Write data to a file from inside of an iterator
Date Sat, 05 Nov 2016 18:02:51 GMT
Ah, the use case of Graphulo <http://graphulo.mit.edu/>'s OneTable
<https://github.com/Accla/graphulo/blob/master/src/main/java/edu/mit/ll/graphulo/Graphulo.java#L807>
call.
Internally the OneTable call sets up a special iterator
(RemoteWriteIterator) that does open a BatchWriter.  The main trick that
allows it to write entries safely is pushing row/column filters into the
iterator, so that the iterator controls re-seeking rather than Accumulo.
This allows the iterator to write all its entries and close() without
having to worry about Accumulo tearing it down.  See the docs
<https://github.com/Accla/graphulo/blob/master/docs/START_HERE_2016-03-28-Graphulo-UseDesign.pdf>
for a starter.

*cue Josh to warn against the evils of re-purposing tablet servers for
MapReduce cycles* =)

Really, this is advanced stuff.  Graphulo's iterators have been shown to
scale up to 16 nodes for matrix multiply in the last HPEC conference, but
it is possible your use case could break Accumulo, in the worst case
causing deadlock if you don't use it properly.  You're also free to write
your own code using Graphulo's code as a starting point, if you're more
comfortable with that.  You may also decide on another approach such as
launching a MapReduce job against Accumulo's RFiles, which could be better
or worse depending on your use case.

On Sat, Nov 5, 2016 at 10:28 AM, Yamini Joshi <yamini.1691@gmail.com> wrote:

> Hello all
>
> As per https://github.com/apache/accumulo/blob/master/docs/src/
> main/asciidoc/chapters/iterator_design.txt
> "
> Implementations of Iterator might be tempted to open BatchWriters inside
> of an Iterator as a means
> to implement triggers for writing additional data outside of their client
> application. The lifecycle of an Iterator
> is *not* managed in such a way that guarantees that this is safe nor
> efficient. Specifically, there
> is no way to guarantee that the internal ThreadPool inside of the
> BatchWriter is closed (and the thread(s)
> are reaped) without calling the close() method. `close`'ing and recreating
> a `BatchWriter` after every
> Key-Value pair is also prohibitively performance limiting to be considered
> an option."
>
> If I need to write a subset of records generated from an iterator to a
> file/table, I can't use a batch writer inside of an iterator? Is there any
> other way to go about it?
>
> Best regards,
> Yamini Joshi
>

Mime
View raw message