accumulo-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jim Klucar <>
Subject Re: EXTERNAL: Re: Failing Tablet Servers
Date Thu, 20 Sep 2012 22:56:30 GMT
I dont have the code here now, but if you look at the Mutation source there
are some memory numbers in there for its buffers. It depends on your
key/Value sizes as to how many you can do. I believe you can configure the
memory sizes.

Sent from my iPhone

On Sep 20, 2012, at 6:51 PM, "Cardon, Tejay E" <>

  Sorry, yes it’s the AccumuloOutputFormat.  I do about 1,000,000
mutation.puts before I do a context.write.  Any idea how many is safe?



*From:* Jim Klucar []
*Sent:* Thursday, September 20, 2012 4:44 PM
*Subject:* Re: EXTERNAL: Re: Failing Tablet Servers

Do you mean AccumuloOutputFormat? Is the map failing or the reduce failing?
How many Mutation.put are you doing before a context.write? Too many puts
will crash the mutation object. You need to periodically call context.write
and create a new mutation object. At some point I wrote a
ContextFlushingMutation that handled this problem for you, but I'd have to
dig around for it or rewrite it.

Sent from my iPhone

On Sep 20, 2012, at 5:29 PM, "Cardon, Tejay E" <>


Thanks for the quick response.  I’m not seeing any errors in the logger
logs.  I am using native maps, and I left the memory map size at 1GB.  I
assume that’s plenty large if I’m using native maps, right?



*From:* John Vines []
*Sent:* Thursday, September 20, 2012 3:20 PM
*Subject:* EXTERNAL: Re: Failing Tablet Servers

Okay, so we know that you're killing servers. We know when you drop the
amount of data down, you have no issues. There are two immediate issues
that come to mind-
1. You modified tservers opts to give them 10G of memory. Did you up the
memory map size in accumulo-site.xml to make those larger, or did you leave
those alone? Or did you up them to match the 10G? If you upped them and
arne't using the native maps, that would be problematic as you need space
for other purposes as well.

2. You seem to be making giant rows. Depending on your Key/Value size, it's
possible for you to write a row that you cannot send (especially if using a
WholeRowIterator) that can cause a cascading error when doing log recovery.
Are you seeing any sort of errors in your loggers logs?


On Thu, Sep 20, 2012 at 5:05 PM, Cardon, Tejay E <>

I’m seeing some strange behavior on a moderate (30 node) cluster.  I’ve got
27 tablet servers on large dell servers with 30GB of memory each.  I’ve set
the TServer_OPTS to give them each 10G of memory.  I’m running an ingest
process that uses AccumuloInputFormat in a MapReduce job to write 1,000
rows with each row containing ~1,000,000 columns in 160,000 families.  The
MapReduce initially runs quite quickly and I can see the ingest rate peak
on the  monitor page.  However, after about 30 seconds of high ingest, the
ingest falls to 0.  It then stalls out and my map task are eventually
killed.  In the end, the map/reduce fails and I usually end up with between
3 and 7 of my Tservers dead.

Inspecting the tserver.err logs shows nothing, even on the nodes that
fail.  The tserver.out log shows a java OutOfMemoryError, and nothing
else.  I’ve included a zip with the logs from one of the failed tservers
and a second one with the logs from the master.  Other than the out of
memory, I’m not seeing anything that stands out to me.

If I reduce the data size to only 100,000 columns, rather than 1,000,000,
the process takes about 4 minutes and completes without incident.

Am I just ingesting too quickly?


Tejay Cardon

View raw message