accumulo-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From ameet kini <>
Subject Re: Using Accumulo as input to a MapReduce job frequently hangs due to lost Zookeeper connection
Date Wed, 17 Oct 2012 14:10:07 GMT
Turns out that my assumption of tables being partitioned the same way may
be too restrictive. I need to account for join partitions not being
co-located on the same tablet server. So the CompositeInputFormat is not
applicable as I'd initially thought. That said, I hadn't gotten very far
with it, and in particular, couldn't for the life of me figure out how to
configure the mapred.join.expr to work on Accumulo's rfile directory

I ended up extending AccumuloInputFormat to do the join. The record reader
would read table A using AccumuloInputFormat's scannerIterator and issue
BatchScanner lookups to get table B's matching records, similar to Keith's
suggestion above.


That said, I had spent some time trying to configure it with
AccumuloInputFormat, and couldn't get very far because I couldn't figure
out how to write a mapred.join.expr which would work directly on the
underlying rfiles in Accumulo. Even if I flush/compact the table so I end
up with exactly 1 rfile per tablet, the mapred.join.expr is

On Thu, Oct 11, 2012 at 2:57 PM, Billie Rinaldi <> wrote:

> On Wed, Oct 10, 2012 at 7:22 AM, ameet kini <> wrote:
>> I have a related problem where I need to do a 1-1 join (every row in
>> table A joins with a unique row in table B and vice versa). My join
>> key is the row id of the table. In the past, I've used Hadoop's
>> CompositeInputFormat to do a map-side join over data in HDFS
>> (described here
>>  My
>> tables in Accumulo seem to fit the eligibility criteria of
>> CompositeInputFormat: both tables are sorted by the join key, since
>> the join key is the row id in my case, and the tables are partitioned
>> the same way (i.e., same split points).
>> Has anyone tried using CompositeInputFormat over Accumulo tables? Is
>> it possible to configure CompositeInputFormat with
>> AccumuloInputFormat?
> I haven't tried it.  If you do, let us know how it works out.
> Billie
>> Thanks,
>> Ameet
>> On Tue, Aug 21, 2012 at 8:23 AM, Keith Turner <> wrote:
>> > Yeah, that would certainly work.
>> >
>> > You could run two map only jobs (could run concurrently).  A job that
>> > reads D1 and writes to Table3 and a job that reads D2 and writes
>> > Table3.   Map reduce may be faster, unless you want the final result
>> > in Accumulo in which case this may be faster.  The two map reduce jobs
>> > could also produce files to bulk import into table3.
>> >
>> > Keith
>> >
>> > On Mon, Aug 20, 2012 at 8:26 PM, David Medinets
>> > <> wrote:
>> >> Can you use a new table to join and then scan the new table? Use the
>> foreign
>> >> key as the rowid. Basically create your own materialized view.

View raw message