accumulo-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Cornish, Duane C." <>
Subject Accumulo Map Reduce is not distributed
Date Fri, 02 Nov 2012 20:53:46 GMT

I apologize if this discuss should be directed to a hadoop map reduce forum, however, I have
some concern that my problem may be with my use of accumulo.

I have a map reduce job that I want to run over data in a table.  I have an index table and
a support table which contains a subset of the data in the index table.  I would like to map
reduce over the support table on my small 4 node cluster.

I have written a map reduce job that uses the AccumuloRowInputFormat class and sets the support
table as its input table.

In my mapper, I read in a row of the support table, and make a call to a static function which
pulls information out of the index table.  Next, I use the data pulled back from the function
call as input to a call to an external .so file that is stored on the name node.  I then make
another static function call to ingest the new data back into the index table.  (I know I
could emit this in the reduce step, but what I'm ingesting is formatted in a somewhat complex
java object and I already had a static function that ingested it the way I needed it.)  My
reduce step is completely empty.

I output print statements from my mapper to see my progress.  The problem that I'm getting
is that my entire job appears to run in sequence not in parallel.  I am running it from the
accumulo master on the 4 node system.

I realized that my support table is very small and was not being split across any tables.
 I am now presplitting this table across all 4 nodes.  Now, when I run the map reduce job
it appears that 4 separate map reduce jobs run one after each other.  The first map reduce
job runs, gets to 100%, then the next map reduce job runs, etc.  The job is only called once,
why are there 4 jobs running?  Why won't these jobs run in parallel?

Is there any way to set the number of tasks that can run?  This is possible from the hadoop
command line, is it possible from the java API? Also, could my problem stem from the fact
that during my mapper I am making static function calls to another class in my java project,
accessing my accumulo index table, or making a call to an exteral .so library?  I could restructure
the job to avoid making static function calls and I could write directly to the Accumulo table
from my map reduce job if that would fix my problem.  I can't avoid making the external .so
library call.  Any help would be greatly appreciated.


View raw message