accumulo-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Keith Turner <>
Subject Re: joining accumulo tables with mapreduce
Date Wed, 17 Apr 2013 14:59:08 GMT
If I am understaning you correctly, you are proposing for each row a
mapper gets to look that row up in two other tables?  This would
result in a lot of little round trip RPC calls and random disk

I think a better solution would be to read all three tables into your
mappers, and do the join in the reduce.  This solution will avoid all
of the little RPC calls and do lots of sequential I/O instead of
random accesses.  Between the map and reduce, you could track which
table each row came from.  Any filtering could be done in the mapper
or by iterators.  Unfortunately Accumulo does not have the needed
input format for this out of the box.  There is a ticket,

On Tue, Apr 16, 2013 at 5:28 PM, Aji Janis <> wrote:
> Hello,
>  I am interested in learning what the best solution/practices might be to
> join 3 accumulo tables by running a map reduce job. Interested in getting
> feedback on best practices and such. Heres a pseudo code of what I want to
> accomplish:
> AccumuloInputFormat accepts tableA
> Global variable <table_list> has table names: tableB, tableC
> In a mapper, for example, you would do something like this:
> for each row in TableA
>  if ( == "abc" && row.qualifier == "xyz") value = getValue()
>  if (foundvalue) {
>   for each table in table_list
>     scan table with (this rowid && family = "def")
>     for each entry found in scan
>       write to final_table (rowid, value_as_family, tablename_as_qualifier,
> entry_as_value_string)
> }//end if foundvalue
> }//end for loop
> This is a simple version of what I want to do. In my non mapreduce java code
> I would do this by calling a using different scanners per table in the list.
> Couple questions:
> - how bad/good is performance when using scanners withing mappers?
> - if I get one mapper per range in tableA, do I reset scanners? how? or
> would I set up a scanner in the setup() of mapper ? --> i have no clue how
> this will play out so thinking out loud here.
> - any optimization suggestions? or examples of creating join_tables/indexes
> out there that I can refer to?
> Thank you for all suggestions.

View raw message