accumulo-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Peter Tillotson <>
Subject Re: Iterators - updating other rows
Date Mon, 15 Jul 2013 13:31:59 GMT
Reading the paper and looking at you're implementation, this is certainly in the ball park
I am striving for. The way I think of it is each ''spreadsheet cell'' should look after itself,
it's called data flow architectures in some of the older literature. 

My current implementation uses Iterators and a my data is split over several column qualifiers,
which I know will be processed in order. Actions on current column are depend on the state
of previous columns. What I'm trying to avoid are disk seeks - if I can fold updates in during
compaction I can reduce wasted operations. 

I've effectively got context for the Observer. 

Tnx for the deadlock scenarios - I'm pretty certain it is Situation 1.


 From: Keith Turner <>
To:; Peter Tillotson <> 
Sent: Monday, 15 July 2013, 13:49
Subject: Re: Iterators - updating other rows

On Mon, Jul 15, 2013 at 6:38 AM, Peter Tillotson <> wrote:

I've got two tables of dependent data, which I was hoping to update efficiently during compaction.
This leads to the following requirements:
>  - Changes to other rows
>  - Changes in other tables
>I've fought with iterators and embedding writers, but have had to fall back to map reduce
jobs to complete the update. 
>Is there a recommended approach to this?

Writing to Accumulo from an iterator can lead to deadlock.  I can think of at least the following
two situations, but there are probably more.

Situation 1 

 1. Memory is full on tablet server 1 and writes are held
 2. Tablet X is on Tserver 1 and is scheduled for compaction to free memory
 3. Tablet X tries to write to Tablet server 1, but the writes block because memory is full
 4. No other tablet on Tserver 1 can be written to because memory is full and can not be
     so the problem snowballs

Situation 2

 1. Tserver 2 is hosting Tablet Y & Z
 2. Tablet Y & Z have data in memory
 3. Tserver 2 dies
 4. Tserver 3 loads Tablet Y, recovers its data, and tries to compact
 5 Tablet Y tries to write to Tablet Z during compaction 
 6. Tserver 4 loads Tablet Z, recovers its data, and tries to compact
 7 Tablet Z tries to write to Tablet Y during compaction 
 8. Tablets Y & Z are not loaded yet, but trying to write each other (deadlock)
 9. Tablet servers 2 and 3 can not load any more tablets, because their load threads are
both stuck.
     so the problem snowballs

I am currently working on an implementation of Percolator[1].  Not something you can use
now, but I am curious if you could use Percolator to solve your problem?  I am very interested
in feedback on this project while its in its formative stages.  I hope to have it finished
w/ Accumulo 1.6.0.


>I bit more detail about the algorithm. 
>I've two tables with different sort orders, and I use ngram row ids to group element and
split over multiple tablets, so:
>nm: key1: 000: newValueId2
>nm: key2: type: valueId1
>nm: key3: type: valueId1
>ab: valueId1: 001: blob
>ab: valueId1:key2: nm
>Multiple keys point to the same value in the other table but both keys and values are
liable to changes ... what I was trying to do was use special columns (column Qaulifier 000
above), I call them care-of to do redirects as data changes real-time, with iterators this
would becomes eventually consistent and be very efficiently but a MapReduce approach requires
multiple table scans of each large table. I like the approach because the ngram splits / groups
data and the two different sorts give me different nice query characteristics.
>For some reason the embedded writers were blocking - I may retry with a larger cluster.
I fought with it for a few days then resorted to MapReduce jobs until I get a chance to look
at the Accumulo code more closely. 
>Would it be easy to add a special iterator that accepts (Text, Mutation) pairs much as
the AccumuloOutputFormat does ?  
>Many thanks in advance
View raw message